diff --git a/main.tex b/main.tex index bbb3d5469011496364ba8561adaa971d6e8918be..5fe6a94f4d84d453ab2afc208deb6c06a1041535 100644 --- a/main.tex +++ b/main.tex @@ -49,7 +49,7 @@ \maketitle % typeset the header of the contribution % \begin{abstract} -The convergence speed of machine learning models trained with Federated Learning is significantly affected by non-independent and identically distributed (non-IID) data partitions, even more so in a fully decentralized setting without a central server. In this paper, we show that the impact can be significantly reduced by carefully designing the communication topology. We present D-Cliques, a novel topology that reduces gradient bias by grouping nodes in cliques such that their local joint distribution is representative of the global class distribution. We show how D-Cliques can be used to successfully implement momentum, which is critical to efficiently train deep convolutional networks but otherwise detrimental in a non-IID setting. We then present an extensive experimental study on MNIST and CIFAR10 and demonstrate that D-Cliques provides similar convergence speed as a fully-connected topology with a significant reduction in the number of required edges and messages. In a 1000-node topology, D-Cliques requires 98\% less edges and 96\% less total messages to achieve a similar accuracy. Our study suggests that maintaining full connectivity within cliques is necessary for fast convergence, but we show that we can reduce inter-clique connectivity by relying on a small-world topology with a logarithmic number of edges without damaging the convergence speed. D-cliques with a small-world topology leads to a further 22\% reduction in the number of edges in a 1000-node network, with even bigger gains expected at larger scales. +The convergence speed of machine learning models trained with Federated Learning is significantly affected by non-independent and identically distributed (non-IID) data partitions, even more so in a fully decentralized setting without a central server. In this paper, we show that the impact can be significantly reduced by carefully designing the communication topology. We present D-Cliques, a novel topology that reduces gradient bias by grouping nodes in cliques such that their local joint distribution is representative of the global class distribution. We refine D-Cliques with additional bias reducing mechanisms, tested on MNIST and CIFAR10, and demonstrate that D-Cliques provides similar convergence speed as a fully-connected topology with a significant reduction in the number of required edges and messages. In a 1000-node topology, D-Cliques requires 98\% less edges and 96\% less total messages to achieve a similar accuracy, with further possible gains using a small-world topology. \keywords{Decentralized Learning \and Federated Learning \and Topology \and Non-IID Data \and Stochastic Gradient Descent} @@ -64,21 +64,21 @@ Non-IID Data \and Stochastic Gradient Descent} % 3/ In this paper, we show that the effect of topology is very significant for non-iid data. Unlike for centralized FL approaches, this happens even when nodes perform a single local update before averaging. We propose an approach to design a sparse data-aware topology which recovers the convergence speed of a centralized approach. % 4/ An originality of our approach is to work at the topology level without changing the original efficient and simple D-SGD algorithm. Other work to mitigate the effect of non-iid on decentralized algorithms are based on performing modified updates (eg with variance reduction) or multiple averaging steps. -Machine learning is currently shifting from the classic \emph{centralized} +Machine learning is currently shifting from a \emph{centralized} paradigm, in which models are trained on data located on a single machine or in a data center, to more \emph{decentralized} ones. Indeed, data is often inherently decentralized as it is collected by several -parties (such as different hospitals, companies, personal devices...). +parties (hospitals, companies, personal devices...). Federated Learning (FL) allows a set of data owners to collaboratively train machine learning models on their joint data while keeping it decentralized, thereby avoiding the costs of moving data as well as mitigating privacy and confidentiality concerns -\cite{kairouz2019advances}. -Due to the decentralized nature of data collection, the local datasets of +\cite{kairouz2019advances}. +Because the data is decentralized, the local datasets of participants can be very different in size and distribution: they are \emph{not} independent and identically distributed -(non-IID). In particular, the class distributions may vary a lot +(non-IID). In particular, the class distributions may significantly vary across local datasets \cite{quagmire}. Therefore, one of the key challenges in FL is to design algorithms that can efficiently deal with such non-IID data @@ -91,7 +91,7 @@ iteratively aggregates model updates received from the participants (\emph{clients}) and sends them back the aggregated model \cite{mcmahan2016communication}. In contrast, fully decentralized FL algorithms operate over an arbitrary topology where -participants communicate in a peer-to-peer fashion with their direct neighbors +participants communicate only with their direct neighbors in the network graph. A classic example of such algorithms is Decentralized SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between local SGD updates and model averaging with neighboring nodes. @@ -105,52 +105,39 @@ enough such that all participants have small (constant or logarithmic) degree \cite{lian2017d-psgd}. Recent work has shown both empirically \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse topologies like rings or grids do not significantly affect the convergence -rate compared to using denser topologies when data is IID. +rate compared to using denser topologies with IID data. % We also note that full decentralization can also provide benefits in terms of % privacy protection \cite{amp_dec}. -In contrast to the IID case, we show in this work that \emph{the impact of +In contrast to the IID case, we show that \emph{the impact of topology is very significant for non-IID data}. This phenomenon is illustrated -in Figure~\ref{fig:iid-vs-non-iid-problem}, where we see that using a ring or -grid topology completely jeopardize the convergence rate in practice when +in Figure~\ref{fig:iid-vs-non-iid-problem}: using a ring or +grid topology completely jeopardizes the convergence rate when classes are imbalanced across participants. -We stress that the fact unlike for centralized FL approaches +We stress that, unlike for centralized FL \cite{kairouz2019advances,scaffold,quagmire}, this happens even when nodes perform a single local update before averaging the -mode with their neighbors. We thus study the following question: +model with their neighbors. We thus study the following question: % \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?} -\textit{Are there sparse topologies that allow a similar convergence +\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with non-IID class distribution?} We answer this question in the affirmative by proposing \textsc{D-Cliques}, an approach to design \emph{a sparse data-aware topology which allows to recover the convergence -speed of a centralized (or IID) approach}. Our proposal includes a simple +speed of a centralized (or IID) approach}. Our proposal includes Clique Averaging, a simple modification of the standard D-SGD algorithm which ensures that gradients are -unbiased with respect to the class distribution. +unbiased with respect to the class distribution. Clique Averaging can be used to implement optimization +techniques, such as momentum, that otherwise rely on an IID assumption on mini-batches. + We empirically evaluate our approach on MNIST and CIFAR10 datasets using -logistic -regression and deep convolutional models with up to 1000 participants. This is +logistic regression and deep convolutional models with up to 1000 participants. This is in contrast to most previous work on fully decentralized algorithms -considering only a few tens of participants \cite{tang18a,more_refs}, which -fall short of -giving a realistic view of the performance of these algorithms in actual -applications. -\aurelien{TODO: complete above paragraph with more details and highlighting -other contributions as needed} - -To summarize, our contributions are as follows: -\begin{enumerate} - \item we show the significant impact of topology on convergence speed in the presence of non-IID data in decentralized learning; - \item we propose the D-Cliques topology to remove the impact of non-IID data on convergence speed, similar to a fully-connected topology. At a scale of 1000 nodes, this represents a 98\% reduction in the number of edges ($18.9$ vs $999$ edges per node on average) and a 96\% reduction in the total number of required messages; - \item we show how to leverage cliques to: (1) remove gradient bias that originate from inter-clique edges; - (2) implement momentum, a critical optimization technique to quickly train convolutional networks, that otherwise significantly \textit{decreases} convergence speed in the presence of non-IID data; - \item we show that, among the many possible choices of inter-clique topologies, a smallworld topology provides a convergence speed close to fully-connecting all cliques pairwise, but requires only $O(n + log(n))$ instead of $O(n^2)$ edges where $n$ is the number of nodes. At a scale of 1000 nodes, this represents a further 22\% reduction in the number of edges compared to fully-connecting cliques ($14.6$ vs $18.9$ edges per node on average) and suggests possible bigger gains at larger scales. -\end{enumerate} - -The rest of this paper is organized as follows. \aurelien{TO COMPLETE} +considering only a few tens of participants \cite{tang18a,more_refs}. With 1000 participants, the resulting design requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) and a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling. + +The rest of this paper is organized as follows. We first present the problem, as well as our assumptions (Section~\ref{section:problem}). When then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate whether they are actually necessary: first by comparing to similar non-clustered topologies (Section~\ref{section:non-clustered}), and then by showing the impact of breaking full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a comparison to related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}). \begin{figure} \centering @@ -173,10 +160,7 @@ The rest of this paper is organized as follows. \aurelien{TO COMPLETE} \end{subfigure} \caption{IID vs non-IID Convergence Speed. Thin lines are the minimum and maximum accuracy of individual nodes. Bold lines are the average - accuracy across all nodes.\protect \footnotemark \aurelien{TODO: make - the figure more self-contained (adding all information to - understand the setting) and perhaps add a fig to show the - effect of D-Cliques}} + accuracy across all nodes.\protect \footnotemark. The blue curve shows convergence in the IID case: the topology has limited effect. The orange curve show convergence in the non-IID case: the topology has a significant effect. When fully-connected the IID and non-IID cases converge similarly.} \label{fig:iid-vs-non-iid-problem} \end{figure} @@ -209,7 +193,7 @@ To focus our study while still retaining the core aspects of the problem, we mak \subsection{Learning Algorithm} -We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{lian2017d-psgd}, illustrated in Algorithm~\ref{Algorithm:D-PSGD}. A single step consists of sampling the local distribution $D_i$, computing and applying a stochastic gradient descent (SGD) step with regard to that sample, and averaging the model with its neighbours. Both outgoing and incoming weights of $W$ must sum to 1, i.e. $W$ is doubly stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$), and communication is symmetric, i.e. $W_{ij} = W_{ji}$. +We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{lian2017d-psgd}, illustrated in Algorithm~\ref{Algorithm:D-PSGD}. A single step consists of sampling the local distribution $D_i$, computing and applying a stochastic gradient descent (SGD) step on that sample, and averaging the model with its neighbours. Both outgoing and incoming weights of $W$ must sum to 1, i.e. $W$ is doubly stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$), and communication is symmetric, i.e. $W_{ij} = W_{ji}$. \begin{algorithm}[h] \caption{D-PSGD, Node $i$} @@ -236,7 +220,7 @@ D-PSGD can be used with a variety of models, including deep learning networks. I \section{D-Cliques: Creating Locally Representative Cliques} \label{section:d-cliques} -To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In the other (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid: this particular example illustrates a neighbourhood with two nodes with examples of the same class adjacent to each other. +To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. \begin{figure} \centering @@ -254,11 +238,11 @@ To have a preliminary intuition of the impact of non-IID data on convergence spe \label{fig:grid-iid-vs-non-iid-neighbourhood} \end{figure} -For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence but makes the comparison easier. A single training step, from the point of view of the middle node of Figure~\ref{fig:grid-IID-vs-non-IID}, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes. +For the sake of the argument, assume all nodes are initialized with the same model, which is not critical for quick convergence but makes the comparison easier. A single training step, from the point of view of the middle node of Figure~\ref{fig:grid-IID-vs-non-IID}, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes. -In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. +In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes: the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. -However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. Moreover, as the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical. +However, in the non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model, they will diverge from one another according to the classes represented in their neighbourhood, more than in the IID case. Moreover, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical. \begin{algorithm}[h] \caption{D-Clique Construction} @@ -286,7 +270,7 @@ However, in the (rather extreme) non-IID case illustrated, there are not enough Under our non-IID assumptions (Section~\ref{section:non-iid-assumptions}), a balanced representation of classes, similar to that of the IID case, can be recovered by modifying the topology such that each node has direct neighbours of all classes. Moreover, as we shall show in the next sections, there are benefits in ensuring the clustering of neighbours into a \textit{clique}, such that, within a clique, neighbours of a node are also directly connected. To ensure all cliques still converge to a single model, a number of inter-clique connections are introduced, established directly between nodes that are part of cliques. Because the joint location distributions $D_{\textit{clique}} = \sum_{i \in \textit{clique}} D_i$ is representative of the global distribution, similar to the IID case, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. And because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. -The construction of the resulting \textit{decentralized cliques} (d-cliques) topology can be performed with Algorithm~\ref{Algorithm:D-Clique-Construction}. Essentially, each clique $C$ is constructed one at a time by selecting nodes with differing classes. Once all cliques are constructed, intra-clique and inter-clique edges are added. +The construction of the resulting \textit{decentralized cliques}, \textsc{D-Cliques}, topology can be performed with Algorithm~\ref{Algorithm:D-Clique-Construction}. Essentially, each clique $C$ is constructed one at a time by selecting nodes with differing classes. Once all cliques are constructed, intra-clique and inter-clique edges are added. Finally, weights are assigned to edges to ensure quick convergence. For this study we use Metropolis-Hasting (CITE), which while not offering optimal convergence speed in the general case, provides good convergence by taking into account the degree of immediate neighbours: @@ -297,7 +281,7 @@ Finally, weights are assigned to edges to ensure quick convergence. For this stu \end{cases} \end{equation} -In this paper, we focus on showing the convergence benefits of such a topology for decentralized federated learning. Algorithm~\ref{Algorithm:D-Clique-Construction} therefore centrally generates the topology, which is then tested in a simulator. We expect this algorithm should be straightforward to adapt for a decentralized execution: the computation of the classes globally present, $L$, could be computed PushSum (CITE), and the selection of neighbours done with PeerSampling (CITE). +In this paper, we focus on showing the convergence benefits of such a topology for decentralized federated learning. Algorithm~\ref{Algorithm:D-Clique-Construction} therefore centrally generates the topology, which is then tested in a simulator. We expect this algorithm should be straightforward to adapt for a decentralized execution: the computation of the classes globally present, $L$, could be computed using PushSum (CITE), and the selection of neighbours done with PeerSampling (CITE). \begin{figure}[htbp] \centering @@ -334,7 +318,7 @@ Using Algorithm~\ref{Algorithm:D-Clique-Construction} on a network of 100 nodes \section{Removing Gradient Bias from Inter-Clique Edges} \label{section:clique-averaging} -Inter-clique connections create sources of bias. By averaging models after a gradient step, D-PSGD effectively gives a different weight to the gradient of neighbours. +Inter-clique connections create sources of bias. By averaging models after a gradient step, D-PSGD effectively gives different weights to neighbours' gradients. \begin{figure}[htbp] \centering @@ -346,7 +330,7 @@ Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simples Node A will have a weight of $\frac{12}{110}$ while all of A's neighbours will have a weight of $\frac{11}{110}$, except the green node connected to B, that will have a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's class and away from the green class. The same analysis holds for all other nodes without inter-clique edges. For node B, all neighbours and B will have weights of $\frac{1}{11}$. However, the green class is represented twice while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training. \begin{algorithm}[h] - \caption{Clique-Unbiased D-PSGD, Node $i$} + \caption{D-PSGD with Clique Averaging, Node $i$} \label{Algorithm:Clique-Unbiased-D-PSGD} \begin{algorithmic}[1] \State \textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$ @@ -359,7 +343,7 @@ Node A will have a weight of $\frac{12}{110}$ while all of A's neighbours will h \end{algorithmic} \end{algorithm} -We solve this problem with a clique-unbiased version of D-PSGD, listed in Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: gradient averaging is decoupled from weight averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, which provides an equal representation to all classes in the computation of the average gradient. But all models of neighbours, including those across inter-clique edges, are used for computing the distributed average of models as in the original version. +We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}): gradient averaging is decoupled from model averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, providing an equal representation to all classes. But all models of neighbours, including those across inter-clique edges, participate in the model averaging as in the original version. % To regenerate figure, from results/mnist: % python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png @@ -369,13 +353,15 @@ We solve this problem with a clique-unbiased version of D-PSGD, listed in Algori \caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.} \end{figure} -As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed: the node with lowest accuracy performs as well as nodes on average when not using clique averaging. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. These benefits are obtained at a higher messaging cost, double to that without clique averaging, and increases latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. MNIST and a Linear model are relatively simple, so the next section shows to work with a harder dataset and a higher capacity model. +As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. MNIST and a Linear model are relatively simple, so the next section shows to work with a harder dataset and a higher capacity model. \section{Implementing Momentum with Clique Averaging} +\label{section:momentum} -Training higher capacity models, such as a deep convolutional network, on harder datasets, such as CIFAR10, is usually done with additional optimization techniques to accelerate convergence speed in centralized settings. But sometimes, these techniques rely on an IID assumption in local distributions which does not hold in more general cases. We show here how Clique Averaging (Section~\ref{section:clique-averaging}) easily enables the implementation of these optimization techniques in the more general non-IID setting with D-Cliques. +Quickly training higher capacity models, such as a deep convolutional network, on harder datasets, such as CIFAR10, usually requires additional optimization techniques. We show here how Clique Averaging (Section~\ref{section:clique-averaging}) easily enables the implementation of optimization techniques in the more general non-IID setting, that otherwise would require IID mini-batches. -In particular, we implement momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps. Momentum is critical for making deep convolutional networks, such as LeNet, converge quickly. However, a simpler application of momentum in a non-IID setting can actually be detrimental. As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}, the convergence of LeNet on CIFAR10 with momentum with 100 nodes using the d-cliques topology is so bad that the network actually fails to converge. To put things in context, we compare the convergence speed to a single centralized IID node performing the same number of updates per epoch, therefore using a batch size 100 times larger: this is essentially equivalent to completely removing the impact of the topology, non-IIDness, and decentralized averaging on the convergence speed. As shown, not using momentum gives a better convergence speed, but this is still far off from the one that would be obtained with a single centralized IID node, so momentum is actually necessary. +In particular, we implement momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps. Momentum is critical for making deep convolutional networks, such as LeNet (CITE), converge quickly. However, a simple application of momentum in a non-IID setting can actually be detrimental. As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}, LeNet, on CIFAR10 with 100 nodes using the +D-Cliques and momentum, actually fails to converge. To put things in context, we compare the convergence speed to a single centralized IID node performing the same number of updates per epoch, therefore using a batch size 100 times larger: this is essentially equivalent to completely removing the impact of the topology, non-IIDness, and decentralized averaging on the convergence speed. As shown, not using momentum gives a better convergence speed, but there is still a significant gap. \begin{figure}[htbp] \centering @@ -397,7 +383,7 @@ In particular, we implement momentum (CITE), which increases the magnitude of th \caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet} \end{figure} -Using D-Cliques (Section~\ref{section:d-cliques}) and Clique Averaging (Section~\ref{section:clique-averaging}), unbiased momentum can be calculated from the clique-unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: +Using Clique Averaging (Section~\ref{section:clique-averaging}), unbiased momentum can be calculated from the unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: \begin{equation} v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)} \end{equation} @@ -406,11 +392,12 @@ It then suffices to modify the original gradient step to use momentum: x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} \end{equation} -Using momentum closes the gap, with a slightly lower convergence speed in the first 20 epochs, as illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}. We expect a similar approach could also enable other optimization techniques (CITE) to be usable in non-IID settings. +Using momentum closes the gap, with a slightly lower convergence speed in the first 20 epochs, as illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}. We expect a similar approach could enable other optimization techniques (CITE) in non-IID settings. \section{Comparison to Similar Non-Clustered Topologies} + \label{section:non-clustered} -We have previously shown that D-Cliques, can effectively provide similar convergence speed as a fully-connected topology and even a single IID node. We now show, in this section and the next, that the particular structure of D-Cliques is necessary. In particular, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours, but since nodes do not form a clique, no node actually compute the same resulting average gradient. +We have previously shown that D-Cliques can effectively provide similar convergence speed as a fully-connected topology and even a single IID node. We now show, in this section and the next, that the particular structure of D-Cliques is necessary. In particular, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours, but since nodes do not form a clique, no node actually compute the same resulting average gradient. Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems that clustering in this case has a small but significant effect. @@ -437,6 +424,7 @@ Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison- For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question. \section{Importance of Intra-Clique Full Connectivity} +\label{section:intra-clique-connectivity} Having established that clustering, i.e. the creation of cliques, has a significant effect, we evaluate the necessity of intra-clique full connectivity. Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-Cliques with respectively 1 and 5 edges randomly removed, out of 45 (2 and 10 out of 90 if counting both direction separately), as well as with and without Clique Averaging (resulting in a biased average gradient within cliques). In all cases, both for MNIST and CIFAR10, it has significant effect on the convergence speed. In the case of CIFAR10, it also negates the benefits of D-Cliques. Full-connectivity within cliques is therefore necessary. @@ -465,13 +453,13 @@ Having established that clustering, i.e. the creation of cliques, has a signific \section{Scaling with Different Inter-Clique Topologies} \label{section:interclique-topologies} -We finally evaluate the effect of the inter-clique topology on convergence speed on a larger network of 1000 nodes, dividing the batch size by 10, so the number of updates per epoch remains constant compared to the previous results for 100 nodes. We compare the scalability and resulting convergence speed of different scheme based around D-Cliques, and therefore all using $O(nc)$ edges to create cliques as a foundation, where $n$ is the number of nodes and $c$ is the size of a clique. +We finally evaluate the effect of the inter-clique topology on convergence speed on a larger network of 1000 nodes, dividing the batch size by 10 so the number of updates per epoch remains constant compared to the previous results for 100 nodes. We compare the scalability and convergence speed of different scheme based around D-Cliques, and therefore all using $O(nc)$ edges to create cliques as a foundation, where $n$ is the number of nodes and $c$ is the size of a clique. First, the scheme that uses the fewest (almost\footnote{A path uses one less edge at significantly slower convergence speed and is therefore never really used in practice.}) number of extra edges is a \textit{ring}. A ring adds $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in $O(n)$. -Second, surprisingly (to us), another scheme also scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, when the number of nodes keeps growing, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes. +Second, surprisingly (to us), another scheme also scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, as nodes are added, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes. -Third, cliques may also be connected in a smallworld-like~\cite{watts2000small} topology, that may be reminiscent of distributed-hash table designs such as Chord (CITE). In this scheme, cliques are first arranged in a ring as in the first scheme. Then each clique add symmetrically one edge, both clockwise and counter-clockwise on the ring, to the $k$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring, as detailed in Algorithm~\ref{Algorithm:Smallworld}. This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small (including nodes further on the ring). This scheme adds a $2klog(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes. +Third, cliques may also be connected in a smallworld-like~\cite{watts2000small} topology, that may be reminiscent of distributed-hash table designs such as Chord (CITE). In this scheme, cliques are first arranged in a ring. Then each clique add symmetric edges, both clockwise and counter-clockwise on the ring, to the $ns$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring, as detailed in Algorithm~\ref{Algorithm:Smallworld}. This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small (including nodes further on the ring). This scheme adds a $2(ns)log(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes. \begin{algorithm}[h] \caption{$\textit{smallworld}(DC)$: adds $O(\# N + log(\# N))$ edges} @@ -499,9 +487,9 @@ Third, cliques may also be connected in a smallworld-like~\cite{watts2000small} \end{algorithmic} \end{algorithm} -Finally, we can also fully connect cliques together, which bounds the average shortest path to $2$ between any pair of nodes. This adds $\frac{n}{c}(\frac{n}{c} - 1)$ edges, which scales quadratically in the number of nodes in $O(\frac{n^2}{c^2})$, which can be significant at larger scales when $n$ is large compared to $c$. +Finally, we also fully connect cliques together, which bounds the average shortest path to $2$ between any pair of nodes. This adds $\frac{n}{c}(\frac{n}{c} - 1)$ edges, which scales quadratically in the number of nodes, in $O(\frac{n^2}{c^2})$, which can be significant at larger scales when $n$ is large compared to $c$. -Figure~\ref{fig:d-cliques-cifar10-convolutional} shows convergence speeds for all schemes, both on MNIST and CIFAR10, compared to a single node IID performing the same number of updates per epoch (showing the faster convergence speed achievable if topology had no impact). A ring converges but is much slower. Our "fractal" scheme helps significantly, while still scaling linearly in the number of nodes. But the sweet spot really seems to be with a smallworld topology, as the convergence speed is almost the same to a fully-connected topology, but uses 22\% less edges at that scale (14.5 edges on average instead of 18.9), and seems to have potential to have larger benefits at larger scales. Nonetheless, even the fully-connected topology offers significant benefits with 1000 nodes, as it represents a 98\% reduction in the number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96\% reduction in the number of messages (37.8 messages per round per node on average instead of 999). +Figure~\ref{fig:d-cliques-cifar10-convolutional} shows convergence speeds for all schemes, both on MNIST and CIFAR10, compared to a single node IID performing the same number of updates per epoch (showing the faster convergence speed achievable if topology had no impact). A ring converges but is much slower. Our "fractal" scheme helps significantly. But the sweet spot really seems to be with a smallworld topology, as the convergence speed is almost the same to a fully-connected topology, but uses 22\% less edges at that scale (14.5 edges on average instead of 18.9), and seems to have potential to have larger benefits at larger scales. Nonetheless, even the fully-connected topology offers significant benefits with 1000 nodes, as it represents a 98\% reduction in the number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96\% reduction in the number of messages (37.8 messages per round per node on average instead of 999). \begin{figure}[htbp] \centering @@ -525,9 +513,7 @@ Figure~\ref{fig:d-cliques-cifar10-convolutional} shows convergence speeds for al \end{figure} \section{Related Work} - -\aurelien{not sure yet if it is better to have this section here or earlier, -we'll see} +\label{section:related-work} \aurelien{TODO: where to place TornadoAggregate and related refs?} @@ -607,12 +593,9 @@ appropriate choice of data-dependent topology can effectively compensate for non-IID data. \section{Conclusion} +\label{section:conclusion} -We have shown the significant impact of the topology with non-IID data partitions in decentralized federated learning. We have proposed D-Cliques, a sparse topology that recovers the convergence speed and non-IID compensating behaviour of a fully-connected topology. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually possible with IID mini-batches. - -We have shown that the clustering of nodes in D-Cliques seems to benefit convergence speed, especially on CIFAR10 with a deep convolutional network, as random topologies with the same number of edges, diversity of nodes, and even unbiased gradients, converge slower. We have also show that full connectivity within cliques is critical, as removing even a single edge per clique significantly slows down converge. - -Finally, we have evaluated different inter-clique topologies with 1000 nodes and while they all provide significant reduction in the number of edges compared to fully connecting all nodes, a smallworld approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed. +We have proposed D-Cliques, a sparse topology that recovers the convergence speed and non-IID compensating behaviour of a fully-connected topology. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually possible with IID mini-batches. We have shown the clustering of D-Cliques and full connectivity within cliques to be critical in obtaining these results. Finally, we have evaluated different inter-clique topologies with 1000 nodes. While they all provide significant reduction in the number of edges compared to fully connecting all nodes, a smallworld approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed. The D-Clique topology approach therefore seems promising to reduce bandwidth usage on FL servers and to implement fully decentralized alternatives in a wider range of applications. %\section{Future Work} %\begin{itemize}