diff --git a/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png b/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png index 9bd7f7be0d0325d76e3f5f80b0af6a0bd14764ca..ff5071394e91e991787ab16e86221ca491d36836 100644 Binary files a/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png and b/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png differ diff --git a/figures/d-cliques-cifar10-clique-clustering-fcc.png b/figures/d-cliques-cifar10-clique-clustering-fcc.png index ffda6beb23b7fce85a439207124cd05b874e4de1..c7782064c632433cd5116223cd334a3727aa7713 100644 Binary files a/figures/d-cliques-cifar10-clique-clustering-fcc.png and b/figures/d-cliques-cifar10-clique-clustering-fcc.png differ diff --git a/figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png b/figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png index 6ff1a68cdd294c266ec8c88d94cd583368abcf1f..a85f283d5f3b2c2b77efa225416bb2490aa952fd 100644 Binary files a/figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png and b/figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png differ diff --git a/figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png b/figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png index a3f86a0788b4667838b46861e716aa2d02eabf69..93cc87ed2235c95478cfb26d3d1dfaec30dd2f6c 100644 Binary files a/figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png and b/figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png differ diff --git a/figures/d-cliques-cifar10-momentum-non-iid-effect.png b/figures/d-cliques-cifar10-momentum-non-iid-effect.png index ac11195ababb2f507516674108e7c9b18a66b439..71ba70479ace7ecc375e3a6f1bd661476b36c908 100644 Binary files a/figures/d-cliques-cifar10-momentum-non-iid-effect.png and b/figures/d-cliques-cifar10-momentum-non-iid-effect.png differ diff --git a/figures/d-cliques-mnist-1000-nodes-comparison.png b/figures/d-cliques-mnist-1000-nodes-comparison.png index 0e23710239c95ec6d4197493a1f688f611d3bd67..cf9efb7af2b26400fb9751c4f76d46fa4ae793ec 100644 Binary files a/figures/d-cliques-mnist-1000-nodes-comparison.png and b/figures/d-cliques-mnist-1000-nodes-comparison.png differ diff --git a/figures/d-cliques-mnist-clique-clustering-fcc.png b/figures/d-cliques-mnist-clique-clustering-fcc.png index c43d7fbbc24c7ff2ad840c91f3aa18caee1dee34..980ed24a1a87cd2b2aa1d27ec1ef9759fa8bc2a9 100644 Binary files a/figures/d-cliques-mnist-clique-clustering-fcc.png and b/figures/d-cliques-mnist-clique-clustering-fcc.png differ diff --git a/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png b/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png index b9187576c03690dad7b23a4d0834b19150c1a6e2..404e2c9fbd95184cd5bdcff76b166c0a5ebb79b6 100644 Binary files a/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png and b/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png differ diff --git a/main.tex b/main.tex index c4306e153f4963076298727a4526bf4fc202e05d..d6dc33e5a4a6c4205bd6b477357a85dce0a2cb38 100644 --- a/main.tex +++ b/main.tex @@ -14,7 +14,7 @@ \usepackage{soul} \usepackage{hyperref} \usepackage{algorithm} -\usepackage{algpseudocode} +\usepackage[noend]{algpseudocode} \usepackage{dsfont} \usepackage{caption} \usepackage{subcaption} @@ -30,7 +30,7 @@ \begin{document} % -\title{D-Cliques: An Efficient Topology to Compensate for Non-IID Data in Decentralized Learning} +\title{D-Cliques: Topology can compensate NonIIDness in Decentralized Federated Learning} % \titlerunning{D-Cliques} % If the paper title is too long for the running head, you can set @@ -205,6 +205,7 @@ on node $i$, and $\mathds{E}_{s_i \sim D_i} F_i(x_i;s_i)$ denotes the expected value of $F_i$ on a random sample $s_i$ drawn from $D_i$. \subsection{Non-IID Data} +\label{section:non-iid-assumptions} Removing the assumption of \textit{independent and identically distributed} (IID) data opens a wide range of potential practical difficulties. While non-IID simply means that a local dataset is a biased sample of the global distribution $D$, the difficulty of the learning problem depends on additional factors that compound with that bias. For example, an imbalance in the number of examples for each class represented in the global distribution compounds with the position of the nodes that have the examples of the rarest class. Additionally, if two local datasets have different number of examples, the examples in the smaller dataset will be visited more often than those in a larger dataset, potentially skewing the optimisation process to perform better on the examples seen more often. @@ -229,14 +230,17 @@ We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{ D-PSGD can be used with a variety of models, including deep learning networks. In the rest of this paper, we use it with a linear (regression) model on MNIST, and with a deep convolutional network on CIFAR10. +%To remove the impact of particular architectural choices on our results, we use a linear classifier (CITE). This model provides up to 92.5\% accuracy when fully converged on MNIST (CITE), about 7\% less than state-of-the-art deep learning networks (CITE). + %\subsection{Clustering} % %From the perspective of one node, \textit{clustering} intuitively represents how many connections exist between its immediate neighbours. A high level of clustering means that neighbours have many edges between each other. The highest level is a \textit{clique}, where all nodes in the neighbourhood are connected to one another. Formally, the level of clustering, between $0$ and $1$, is the ratio of $\frac{\textit{nb edges between neighbours}}{\textit{nb possible edges}}$~\cite{watts2000small}. % -\section{D-Cliques: Locally Recover IID within Cliques} +\section{D-Cliques: Creating Locally Representative Cliques} +\label{section:d-cliques} -To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}, as illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. The color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In the other (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid, with neighbourhood such as this one, sometimes having nodes with examples of the same class adjacent to each other. +To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In the other (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid: this particular example illustrates a neighbourhood with two nodes with examples of the same class adjacent to each other. \begin{figure} \centering @@ -254,13 +258,50 @@ To have a preliminary intuition of the impact of non-IID data on convergence spe \label{fig:grid-iid-vs-non-iid-neighbourhood} \end{figure} -For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence in an IID setting but makes the comparison easier. A single training step, from the point of view of the middle node of Figure~\ref{fig:grid-IID-vs-non-IID}, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes. +For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence but makes the comparison easier. A single training step, from the point of view of the middle node of Figure~\ref{fig:grid-IID-vs-non-IID}, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes. + +In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. -In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. There is some variance remaining from the difference between examples but in practice it has a sufficiently small impact on convergence speed that there are still benefits from parallelizing the computations. +However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. Moreover, as the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical. -However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. As the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical. +\begin{algorithm}[h] + \caption{D-Clique Construction} + \label{Algorithm:D-Clique-Construction} + \begin{algorithmic}[1] + \State \textbf{Require} set of classes globally present $L$, + \State~~ set of all nodes $N = \{ 1, 2, \dots, n \}$, + \State~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$, + \State~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$, + \State~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies}) + \State~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$ + \State $R \leftarrow \{ n~\text{for}~n \in N \}$ \Comment{Remaining nodes} + \State $DC \leftarrow \emptyset$ \Comment{D-Cliques} + \State $\textit{C} \leftarrow \emptyset$ \Comment{Current Clique} + \While{$R \neq \emptyset$} + \State $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$ + \State $R \leftarrow R \setminus \{ n \}; C \leftarrow C \cup \{ n \}$ + \If{$\textit{classes}(C) = L$} + \State $DC \leftarrow DC \cup \{ C \}; C \leftarrow \emptyset$ + \EndIf + \EndWhile + \State \Return $weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC))$ + \end{algorithmic} +\end{algorithm} + +Under our non-IID assumptions (Section~\ref{section:non-iid-assumptions}), a balanced representation of classes, similar to that of the IID case, can be recovered by modifying the topology such that each node has direct neighbours of all classes. Moreover, as we shall show in the next sections, there are benefits in ensuring the clustering of neighbours into a \textit{clique}, such that, within a clique, neighbours of a node are also directly connected. To ensure all cliques still converge to a single model, a number of inter-clique connections are introduced, established directly between nodes that are part of cliques. Because the joint location distributions $D_{\textit{clique}} = \sum_{i \in \textit{clique}} D_i$ is representative of the global distribution, similar to the IID case, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. And because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. + +The construction of the resulting \textit{decentralized cliques} (d-cliques) topology can be performed with Algorithm~\ref{Algorithm:D-Clique-Construction}. Essentially, each clique $C$ are constructed one at a time by selecting nodes with differing classes. Once all cliques are constructed, intra-clique and inter-clique edges are added. + +Finally, weights are assigned to edges to ensure quick convergence, for this study we use Metropolis-Hasting (CITE), which while not offering optimal convergence speed in the general case, provides good convergence by taking into account the degree of immediate neighbours: + +\begin{equation} + W_{ij} = \begin{cases} + max(\text{degree}(i), \text{degree}(j)) + 1 & \text{if}~i \neq j \\ + 1 - \sum_{j \neq i} W_{ij} & \text{otherwise} + \end{cases} +\end{equation} -\subsection{Creating Representative Cliques} +In this paper, we focus on showing the convergence benefits of such a topology for decentralized federated learning. Algorithm~\ref{Algorithm:D-Clique-Construction} therefore centrally generates the topology, which is then tested in a simulator. We expect this algorithm should be straightforward to adapt for a decentralized execution: the computation of the classes globally present, $L$, could be computed PushSum (CITE), and the section of neighbours done with PeerSampling (CITE). \begin{figure}[htbp] \centering @@ -276,46 +317,40 @@ However, in the (rather extreme) non-IID case illustrated, there are not enough \begin{subfigure}[b]{0.55\textwidth} \centering \includegraphics[width=\textwidth]{figures/d-cliques-mnist-vs-fully-connected.png} - \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed on MNIST} + \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed on MNIST. Y-axis starts at 80.} \end{subfigure} \caption{\label{fig:d-cliques-example} D-Cliques} \end{figure} -The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. +Using Algorithm~\ref{Algorithm:D-Clique-Construction} on a network of 100 nodes generates the topology illustrated in Figure~\ref{fig:d-cliques-figure}, with the convergence speed of Figure~\ref{fig:d-cliques-example-convergence-speed}. The convergence speed is quite close to that of a fully-connected topology, and significantly better than that of the ring and grid of Figure~\ref{fig:iid-vs-non-iid-problem}. At a scale of 100 nodes, it uses only $\approx10\%$ of the number of edges of a fully-connected topology, offering a reduction of $\approx90\%$. Nonetheless, there is still significant variance in accuracy between nodes, which we address in the next section by removing the bias actually introduced by inter-clique edges. -The global distribution of classes, for classification tasks, can be computed from the distribution of class examples on the nodes, with Distributed Averaging (CITE). Given the global distribution of classes, neighbours within cliques can be chosen based on a PeerSampling (CITE) service. Both services can be implemented such that they converge in a logarithmic number of steps compared to the number of nodes. It is therefore possible to obtain this information in a scalable way. - In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed. - - TODO: Algo de construction des cliques (incluant les cas où les classes ne sont pas également représentées) + +%The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. +% +%The global distribution of classes, for classification tasks, can be computed from the distribution of class examples on the nodes, with Distributed Averaging (CITE). Given the global distribution of classes, neighbours within cliques can be chosen based on a PeerSampling (CITE) service. Both services can be implemented such that they converge in a logarithmic number of steps compared to the number of nodes. It is therefore possible to obtain this information in a scalable way. +% +% In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed. \section{Removing Gradient Bias from Inter-Clique Edges} +\label{section:clique-averaging} -Inter-clique connections create sources of bias. The distributed averaging algorithm, used by D-PSGD, relies on a good choice of weights for quick convergence, of which Metropolis-Hasting (CITE) provide a reasonable and inexpensive solution by considering only the immediate neighbours of every node. However, by averaging models after a gradient step, D-PSGD effectively gives a different weight to the gradient of neighbours. +Inter-clique connections create sources of bias. By averaging models after a gradient step, D-PSGD effectively gives a different weight to the gradient of neighbours. \begin{figure}[htbp] \centering - \includegraphics[width=0.7\textwidth]{figures/connected-cliques-bias} + \includegraphics[width=0.5\textwidth]{figures/connected-cliques-bias} \caption{\label{fig:connected-cliques-bias} Sources of Bias in Connected Cliques: Non-uniform weights in neighbours of A (A has a higher weight); Non-uniform class representation in neighbours of B (extra green node).} \end{figure} - -Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge connecting the green node of the left clique with the purple node of the right clique. A simple Metropolis-Hasting weight assignment such as the following: -\begin{equation} - W_{ij} = \begin{cases} - max(\text{degree}(i), \text{degree}(j)) + 1 & \text{if}~i \neq j \\ - 1 - \sum_{j \neq i} W_{ij} & \text{otherwise} - \end{cases} -\end{equation} +Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge connecting the green node of the left clique with the purple node of the right clique. Node A will have a weight of $\frac{12}{110}$ while all of A's neighbours will have a weight of $\frac{11}{110}$, except the green node connected to B, that will have a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's class and aways from the green class. The same analysis holds for all other nodes without inter-clique edges. For node B, all neighbours and B will have weights of $\frac{1}{11}$. However, the green class is represented twice while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training. -We solve this problem by decoupling the gradient averaging from the weight averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, which provides an equal representation to all classes in the computation of the average gradient. But the model weights of all neighbours, including those across inter-clique edges, are used for computing the distributed average of models, which ensures that all models eventually converge to the same value. The clique-unbiased version of D-PSGD is listed in Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}. - \begin{algorithm}[h] - \caption{D-Clique (Clique-Unbiased D-PSGD), Node $i$} + \caption{Clique-Unbiased D-PSGD, Node $i$} \label{Algorithm:Clique-Unbiased-D-PSGD} \begin{algorithmic}[1] \State \textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$ @@ -328,34 +363,28 @@ We solve this problem by decoupling the gradient averaging from the weight avera \end{algorithmic} \end{algorithm} +We solve this problem with a clique-unbiased version of D-PSGD, listed in Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: gradient averaging is decoupled from weight averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, which provides an equal representation to all classes in the computation of the average gradient. But the models of all neighbours, including those across inter-clique edges, are used for computing the distributed average of models as in the original version. + % To regenerate figure, from results/mnist: % python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png \begin{figure}[htbp] \centering \includegraphics[width=0.55\textwidth]{figures/d-clique-mnist-clique-avg} -\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST} +\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.} \end{figure} +As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed: the node with lowest accuracy performs as well as the average nodes when not using clique averaging. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. These benefits are obtained at a higher messaging cost, double to that without clique averaging, and increases latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. MNIST and a Linear model are relatively simple, so the next section shows to work with a harder dataset and a higher capacity model. -\subsection{Implementing Unbiased Momentum with Clique Averaging} - -Momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps, is critical for making convolutional networks converge quickly. However it relies on mini-batches to be IID, otherwise, it greatly increases variance between nodes and is actually detrimental to convergence speed. +\section{Implementing Momentum with Clique Averaging} -Momentum can easily be used with D-Cliques, simply by calculating it from the clique-unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: -\begin{equation} -v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)} -\end{equation} -It then suffices to modify the original gradient step to use momentum: -\begin{equation} -x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} -\end{equation} +Training higher capacity models, such as a deep convolutional network, on harder datasets, such as CIFAR10, is usually done with additional optimization techniques to accelerate convergence speed in centralized settings. But sometimes, these techniques rely on an IID assumption in local distributions which does not hold in more general cases. We show here how Clique Averaging (Section~\ref{section:clique-averaging}) easily enables the implementation of these optimization techniques in the more general non-IID setting with D-Cliques. -In addition, it is important that all nodes are initialized with the same model value at the beginning. Otherwise, the random initialization of models introduces another source of variance that persists over many steps. In combination with D-Clique (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), this provides the convergence results of Figure~\ref{fig:d-cliques-cifar10-convolutional}. To assess how far this would be from an "optimal" solution, in which the delay introduced by multiple hops between nodes is completely removed, we also show the convergence speed of a single node that would compute its average gradient from all the samples obtained by all nodes in a single round. The results show that minus the variance introduced by the multiple hops between nodes, which slows the convergence of the distributed averaging of models, the convergence speed on average is close to the optimal, when the distributed average is computed exactly every step. +In particular, we implement momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps. Momentum is critical for making deep convolutional networks, such as LeNet, converge quickly. However, a simpler application of momentum in a non-IID setting can actually be detrimental. As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}, the convergence of LeNet on CIFAR10 with momentum with 100 nodes using the d-cliques topology is so bad that the network actually fails to converge. To put things in context, we compare the convergence speed to a single centralized IID node performing the same number of updates per epoch, therefore using a batch size 100 larger: this is essentially equivalent to completely removing the impact of the topology, non-IIDness, and decentralized averaging on the convergence speed. As shown, not using momentum gives a better convergence speed, but this is still far off from the one that would be obtained with a single centralized IID node, so momentum is actually necessary. \begin{figure}[htbp] \centering % To regenerate figure, from results/cifar10 - % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET --legend 'center right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-cliques w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png + % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100 \begin{subfigure}[b]{0.45\textwidth} \centering \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-effect} @@ -363,249 +392,141 @@ In addition, it is important that all nodes are initialized with the same model \end{subfigure} \hfill % To regenerate figure, from results/cifar10 - % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'lower right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png + % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png --ymax 100 \begin{subfigure}[b]{0.45\textwidth} \centering \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect} \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging} \end{subfigure} - -\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum} -\end{figure} - - -\section{Scaling with Different Inter-Clique Topologies} - -The \textit{"redundancy"} of the data, i.e. how much each additional example in the training set contributes to the final accuracy, influences the minimum number of connections required between cliques to reach a given convergence speed. It needs to be evaluated empirically on a learning task. In effect, redundancy is the best parallelization factor as the more redundant the dataset is, the less nodes need to communicate. For the following arguments, $n$ is the number of nodes and $c$ is the size of a clique. - -For highly redundant datasets, it may be sufficient to arrange cliques in a ring. This is not specific to D-Cliques, it is also the case with IID nodes but it is nonetheless useful to be kept in mind for D-Cliques also. In this case, the number of edges will be $O(nc + \frac{n}{c})$ and therefore linear in the number of nodes $n$. - -For cases with limited redundancy, nodes can be arranged such that they are at most 2 hops away from any other nodes in the network to quickly propagate updates in the network. In effect, this is equivalent to fully connecting cliques (instead of nodes). In this case, the number of edges will be $O(nc + \frac{n^2}{c^2})$ and therefore still exponential in the number of nodes but with a strong reduction in the number of edges when $c$ is large compared to $n$ (ex: $c \geq \frac{n}{100}$). - -In between, there might be enough redundancy in the dataset to arrange cliques in a fractal/hierarchical pattern such that the maximum number of hops between nodes grows logarithmically with $n$. TODO: Complexity argument. - -If we relax the constraint of regularity, a trivial solution is a star topology, as used in most Federated Learning implementations (CITE) at the expense of a high requirement on reliability and available bandwidth on the central node. We instead propose a regular topology, built around \textit{cliques} of dissimilar nodes, locally representative of the global distribution and connected by few links, as illustrated in Figure~\ref{fig:d-cliques-example}. D-Cliques enable similar convergence speed as a fully connected topology, using a number of edges that grows sub-exponentially ($O(nc + \frac{n^2}{c^2})$ where $n$ is the number of nodes and $c$ is the size of a clique\footnote{$O((\frac{n}{c})c^2 + (\frac{n}{c})^2)$, i.e. number of cliques times the number of edges within cliques (squared in the size of cliques) in addition to inter-cliques edges (square of the number of cliques).}.), instead of exponentially in the number of nodes ($O(n^2)$), with a corresponding reduction in bandwidth usage and required number of messages per round of training. In practice, for the cases with networks of size 100 we have tested, that corresponds to a reduction in the number of edges of 90\%. (TODO: Do analysis if the pattern is fractal with three levels at 1000 nodes: cliques, 10 cliques connected pairwise in a "region", and each "region" connected pairwise with other regions) - -\begin{figure}[htbp] - \centering -% To regenerate the figure, from directory results/mnist -% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET fully-connected/all/2021-03-10-09:25:19-CET clique-ring/all/2021-03-10-18:14:35-CET fully-connected-cliques/all/2021-03-10-10:19:44-CET --add-min-max --yaxis test-accuracy --labels '1-node IID bsz=12800' '100-nodes non-IID fully-connected bsz=128' '100-nodes non-IID D-Cliques (Ring) bsz=128' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=128' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-vs-1-node-test-accuracy.png - \centering - \includegraphics[width=0.7\textwidth]{figures/d-cliques-mnist-vs-1-node-test-accuracy} - \caption{\label{fig:d-cliques-mnist-linear-w-clique-averaging-w-initial-averaging} MNIST: D-Cliques Convergence Speed (100 nodes, Constant Updates per Epoch)} -\end{figure} - - % To regenerate the figure, from directory results/mnist - % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET ../scaling/1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET ../scaling/1000/mnist/smallworld-logn-cliques/all/2021-03-23-21:45:39-CET ../scaling/1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET ../scaling/1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET --add-min-max --yaxis test-accuracy --legend 'lower right' --ymin 84 --ymax 92.5 --labels '1 node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --save-figure ../../figures/d-cliques-mnist-1000-nodes-comparison.png -\begin{figure}[htbp] - \centering - \includegraphics[width=0.7\textwidth]{figures/d-cliques-mnist-1000-nodes-comparison} - \caption{\label{fig:d-cliques-mnist-1000-nodes-comparison} MNIST: D-Clique Convergence Speed (1000 nodes, Constant Updates per Epoch)} -\end{figure} - - - \begin{figure}[htbp] - \centering - % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET clique-ring/all/2021-03-10-11:58:43-CET fully-connected-cliques/all/2021-03-10-13:58:57-CET --add-min-max --yaxis training-loss --labels '1-node IID bsz=2000' '100-nodes non-IID D-Cliques (Ring) bsz=20' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=20' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-vs-1-node-training-loss.png - \begin{subfigure}[b]{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-training-loss} -\caption{\label{fig:d-cliques-cifar10-training-loss} Training Loss} - \end{subfigure} - \hfill - % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET clique-ring/all/2021-03-10-11:58:43-CET fully-connected-cliques/all/2021-03-10-13:58:57-CET --add-min-max --yaxis test-accuracy --labels '1-node IID bsz=2000' '100-nodes non-IID D-Cliques (Ring) bsz=20' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=20' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-vs-1-node-test-accuracy.png - \begin{subfigure}[b]{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-test-accuracy} -\caption{\label{fig:d-cliques-cifar10-test-accuracy} Test Accuracy} - \end{subfigure} -\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (100 nodes, Constant Updates per Epoch).} -\end{figure} - - -\begin{figure}[htbp] - \centering - % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --add-min-max --yaxis training-loss --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'upper right' --ymax 3 --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-training-loss.png - \begin{subfigure}[b]{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-training-loss} -\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-training-loss} Training Loss} - \end{subfigure} - \hfill - % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --add-min-max --yaxis test-accuracy --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png - \begin{subfigure}[b]{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy} -\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy} Test Accuracy} - \end{subfigure} -\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (1000 nodes, Constant Updates per Epoch).} +\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet} \end{figure} +Using D-Cliques (Section~\ref{section:d-cliques}) and Clique Averaging (Section~\ref{section:clique-averaging}), unbiased momentum can be calculated from the clique-unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: +\begin{equation} +v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)} +\end{equation} +It then suffices to modify the original gradient step to use momentum: +\begin{equation} +x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} +\end{equation} +Using momentum closes the gap, with a slightly lower convergence speed in the first 20 epochs, as illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}. We expect a similar approach could also enable other optimization techniques (CITE) to be usable in non-IID settings. -\section{Evaluation} + \section{Comparison to Similar Non-Clustered Topologies} - \subsection{How do D-Cliques compare to similar non-clustered topologies?} +We have previously shown that D-Cliques, can effectively provide similar convergence speed as a fully-connected topology and even a single IID node. We now show, in this section and the next, that the particular structure of D-Cliques is necessary. In particular, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare d-cliques, with and without clique averaging, to a random topology chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the D-Clique topology of Fig.~\ref{fig:d-cliques-figure} on 100 nodes. To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours, but since nodes do not form a clique, no node actually compute the same resulting average gradient. -To remove the impact of particular architectural choices on our results, we use a linear classifier (CITE). This model provides up to 92.5\% accuracy when fully converged on MNIST (CITE), about 7\% less than state-of-the-art deep learning networks (CITE). +Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems clustering in this case has a small but significant effect. \begin{figure}[htbp] \centering - \begin{subfigure}[htb]{0.48\textwidth} + \begin{subfigure}[b]{0.48\textwidth} % To regenerate the figure, from directory results/mnist -% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET random-10/all/2021-03-17-20:28:12-CET random-10-diverse/all/2021-03-17-20:28:35-CET random-10-diverse-unbiased-grad/all/2021-03-17-20:29:04-CET --labels 'd-clique (fcc)' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes represented)' '10 random edges (all classes repr.) with unbiased grad.' --add-min-max --legend 'lower right' --ymin 88 --ymax 92.5 --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png +% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET random-10/all/2021-03-17-20:28:12-CET random-10-diverse/all/2021-03-17-20:28:35-CET --labels 'd-clique (fcc)' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes represented)' --add-min-max --legend 'lower right' --ymin 88 --ymax 92.5 --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png --font-size 13 \centering \includegraphics[width=\textwidth]{figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies} - \caption{\label{fig:d-cliques-mnist-linear-comparison-to-non-clustered-topologies} Linear Model} - \end{subfigure} - \hfill - \begin{subfigure}[htb]{0.48\textwidth} -% To regenerate the figure, from directory results/mnist -% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET random-10/all/2021-03-17-20:28:12-CET random-10-diverse/all/2021-03-17-20:28:35-CET random-10-diverse-unbiased-grad/all/2021-03-17-20:29:04-CET --labels 'd-clique (fcc)' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes represented)' '10 random edges (all classes repr.) with unbiased grad.' --add-min-max --legend 'upper right' --ymax 0.7 --yaxis scattering --save-figure ../../figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies-scattering.png - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies-scattering} - \caption{\label{fig:d-cliques-mnist-linear-comparison-to-non-clustered-topologies-scattering} Linear Model (Scattering)} - \end{subfigure} - \\ - \begin{subfigure}[htb]{0.48\textwidth} -% To regenerate the figure, from directory results/mnist/gn-lenet -% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'lower right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymin 80 --yaxis test-accuracy --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies.png - \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies} - \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies} LeNet Model} - \end{subfigure} - \hfill - \begin{subfigure}[htb]{0.48\textwidth} -% To regenerate the figure, from directory results/mnist/gn-lenet -% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'upper right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymax 0.7 --yaxis scattering --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering.png - \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering} - \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering} LeNet Model (Scattering)} - \end{subfigure} - - - \caption{\label{fig:d-cliques-mnist-comparison-to-non-clustered-topologies} MNIST: Comparison to non-Clustered Topologies} -\end{figure} - - \begin{figure}[htbp] - \centering - + \caption{MNIST with Linear Model} + \end{subfigure} + \hfill % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper left' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png --ymax 100 +% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' --add-min-max --legend 'upper left' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png --ymax 119 --font-size 13 \begin{subfigure}[b]{0.48\textwidth} \centering \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies} - \caption{LeNet Model: Convergence Speed} - \end{subfigure} - \hfill - % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper right' --yaxis scattering --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering.png --ymax 0.7 - \begin{subfigure}[b]{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering} - \caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering} LeNet Model: Scattering} - \end{subfigure} - - \caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies} CIFAR10: Comparison to non-Clustered Topologies} - \end{figure} + \caption{CIFAR10 with LeNet} + \end{subfigure} + \caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies} +\end{figure} +For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact but the exact nature of it is still an open question. -\begin{itemize} - \item Clustering does not seem to make a difference in MNIST, even when using a higher-capacity model (LeNet) instead of a linear model. (Fig.\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}) - \item Except for the random 10 topology, convergence speed seems to be correlated with scattering in CIFAR-10 with LeNet model (Fig.\ref{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}). There is also more difference between topologies both in convergence speed and scattering than for MNIST (Fig.~\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}). Scattering computed similar to Consensus Control for Decentralized Deep Learning~\cite{kong2021consensus}. -\end{itemize} +\section{Importance of Intra-Clique Full Connectivity} -\subsection{Is it important to maintain full connectivity within cliques?} +Having established that clustering, i.e. the creation of cliques, has a significant effect, we evaluate the necessity of intra-clique full connectivity. Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-Cliques with respectively 1 and 5 edges randomly removed, out of 45 (2 and 10 out of 90 if counting both direction separately), as well as with and without Clique Averaging (resulting in a biased average gradient within cliques). In all cases, both for MNIST and CIFAR10, it has significant effect on the convergence speed. In the case of CIFAR10, it also negates the benefits of D-Cliques. Full-connectivity within cliques is therefore necessary. \begin{figure}[htbp] + \centering + +\begin{subfigure}[htbp]{0.48\textwidth} \centering % To regenerate the figure, from directory results/mnist -% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET rm-1-edge/all/2021-03-18-17:28:27-CET rm-5-edges/all/2021-03-18-17:29:10-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:28:47-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:29:36-CET --add-min-max --ymin 85 --ymax 92.5 --legend 'lower right' --yaxis test-accuracy --labels 'fcc with clique grad.' 'fcc -1 edge/clique, no clique avg.' 'fcc -5 edges/clique, no clique avg.' 'fcc -1 edge/clique, clique avg.' 'fcc -5 edges/clique, clique avg.' --save-figure ../../figures/d-cliques-mnist-clique-clustering-fcc.png - \includegraphics[width=0.65\textwidth]{figures/d-cliques-mnist-clique-clustering-fcc} -\caption{\label{fig:d-cliques-mnist-clique-clustering} MNIST: Effect of Relaxed Intra-Clique Connectivity.} -\end{figure} - -\begin{figure}[htbp] +% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET rm-1-edge/all/2021-03-18-17:28:27-CET rm-5-edges/all/2021-03-18-17:29:10-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:28:47-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:29:36-CET --add-min-max --ymin 85 --ymax 92.5 --legend 'lower right' --yaxis test-accuracy --labels 'fcc with clique grad.' 'fcc -1 edge/clique, no clique avg.' 'fcc -5 edges/clique, no clique avg.' 'fcc -1 edge/clique, clique avg.' 'fcc -5 edges/clique, clique avg.' --save-figure ../../figures/d-cliques-mnist-clique-clustering-fcc.png --font-size 13 + \includegraphics[width=\textwidth]{figures/d-cliques-mnist-clique-clustering-fcc} +\caption{\label{fig:d-cliques-mnist-clique-clustering} MNIST} +\end{subfigure} +\hfill +\begin{subfigure}[htbp]{0.48\textwidth} \centering % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET rm-1-edge/all/2021-03-18-17:29:58-CET rm-5-edges/all/2021-03-18-17:30:38-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:30:17-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:31:04-CET --add-min-max --ymax 80 --legend 'upper left' --yaxis test-accuracy --labels 'fcc, clique grad.' 'fcc -1 edge/clique, no clique grad.' 'fcc -5 edges/clique, no clique grad.' 'fcc -1 edge/clique, clique grad.' 'fcc -5 edges/clique, clique grad.' --save-figure ../../figures/d-cliques-cifar10-clique-clustering-fcc.png - \includegraphics[width=0.65\textwidth]{figures/d-cliques-cifar10-clique-clustering-fcc} - %\caption{\label{fig:d-cliques-cifar10-clique-clustering-fcc} Fully-Connected D-Cliques} -\caption{\label{fig:d-cliques-cifar10-clique-clustering} CIFAR10: Effect of Relaxed Intra-Clique Connectivity.} +% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET rm-1-edge/all/2021-03-18-17:29:58-CET rm-5-edges/all/2021-03-18-17:30:38-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:30:17-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:31:04-CET --add-min-max --ymax 80 --legend 'upper left' --yaxis test-accuracy --labels 'fcc, clique grad.' 'fcc -1 edge/clique, no clique grad.' 'fcc -5 edges/clique, no clique grad.' 'fcc -1 edge/clique, clique grad.' 'fcc -5 edges/clique, clique grad.' --save-figure ../../figures/d-cliques-cifar10-clique-clustering-fcc.png --font-size 13 + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-clique-clustering-fcc} +\caption{\label{fig:d-cliques-cifar10-clique-clustering} CIFAR10} +\end{subfigure} + +\caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity} \end{figure} - +\section{Scaling with Different Inter-Clique Topologies} +\label{section:interclique-topologies} +We finally evaluate the effect of the inter-clique topology on convergence speed on a larger network of 1000 nodes, dividing the batch size by 10, so the number of updates per epoch remains constant compared to the previous results for 100 nodes. We compare the scalability and resulting convergence speed of different scheme based around D-Cliques, and therefore all using $O(nc)$ edges to create cliques as a foundation, where $n$ is the number of nodes and $c$ is the size of a clique. -\clearpage +First, the scheme that uses the fewest (almost\footnote{A path uses one less edge at significantly slower convergence speed and is therefore never really used in practice.}) number of extra edges is a \textit{ring}. A ring adds $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in $O(n)$. -\subsection{Scaling behaviour as the number of nodes increases?} - - \begin{figure}[htbp] - \centering - % To regenerate the figure, from directory results/scaling -% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/fully-connected-cliques/all/2021-03-12-09:13:27-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-updates.png --ymin 80 --add-min-max +Second, surprisingly (to us), another scheme also scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, when the number of nodes keeps growing, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes. - \begin{subfigure}[b]{0.7\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-updates} - \caption{Fully-Connected (Cliques), $O(\frac{n^2}{c^2} + nc)$ edges} - \end{subfigure} - - % To regenerate the figure, from directory results/scaling -% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates.png --ymin 80 --add-min-max - \begin{subfigure}[b]{0.7\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates} - \caption{Fractal, $O(nc)$ edges} - \end{subfigure} +Third, cliques may also be connected in a smallworld-like~\cite{watts2000small} topology, that may be reminiscent of distributed-hash table designs such as Chord (CITE). In this scheme, cliques are first arranged in a ring as in the first scheme. Then each clique add symmetrically one edge, both clockwise and counter-clockwise on the ring, to the $k$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring, as detailed in Algorithm~\ref{Algorithm:Smallworld}. This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small (including nodes further on the ring). This scheme adds a $2klog(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes. - - % To regenerate the figure, from directory results/scaling -% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/clique-ring/all/2021-03-10-18:14:35-CET 1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-clique-ring-cst-updates.png --ymin 80 --add-min-max - \begin{subfigure}[b]{0.7\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-clique-ring-cst-updates} - \caption{Ring, $O(n)$ edges} - \end{subfigure} - - \caption{\label{fig:d-cliques-mnist-scaling-fully-connected} MNIST: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).} - \end{figure} - - \begin{figure}[htbp] - \centering - - % To regenerate the figure, from directory results/scaling -% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fully-connected-cst-updates.png --add-min-max +\begin{algorithm}[h] + \caption{$\textit{smallworld}(DC)$: adds $O(\# N + log(\# N))$ edges} + \label{Algorithm:Smallworld} + \begin{algorithmic}[1] + \State \textbf{Require} Set of cliques $DC$ (set of set of nodes), size of neighbourhood $ns$ (default 2), function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$ + \State $E \leftarrow \emptyset$ \Comment{Set of Edges} + \State $L \leftarrow [ C~\text{for}~C \in DC ]$ \Comment{Arrange cliques in a list} + \For{$i \in \{1,\dots,\#DC\}$} \Comment{For every clique} + \State \Comment{For sets of cliques exponentially further away from $i$} + \For{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots, \lceil \textit{log}_2(\#DC) \rceil \} \}$} + \State \Comment{Pick the $ns$ closests} + \For{$k \in \{0,\dots,ns-1\}$} + \State \Comment{Add interclique connections in both directions} + \State $n \leftarrow \textit{least\_edges}(L_i, E)$ + \State $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ \Comment{clockwise in ring} + \State $E \leftarrow E \cup \{ (n,m), (m,n) \}$ + \State $n \leftarrow \textit{least\_edges}(L_i, E)$ + \State $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ \Comment{counter-clockwise in ring} + \State $E \leftarrow E \cup \{ (n,m), (m,n) \}$ + \EndFor + \EndFor + \EndFor + \State \Return E + \end{algorithmic} +\end{algorithm} - \begin{subfigure}[b]{0.7\textwidth} +Finally, we can also fully connect cliques together, which bounds the average shortest path to $2$ between any pair of nodes. This adds $\frac{n}{c}(\frac{n}{c} - 1)$ edges, which scales quadratically in the number of nodes in $O(\frac{n^2}{c^2})$, which can be significant at larger scales when $n$ is large compared to $c$. + +Figure~\ref{fig:d-cliques-cifar10-convolutional} shows convergence speeds for all schemes, both on MNIST and CIFAR10, compared to a single node IID performing the same number of updates per epoch (showing the faster convergence speed achievable if topology had no impact). A ring converges but is much slower. Our "fractal" scheme helps significantly, while still scaling linearly in the number of nodes. But the sweet spot really seems to be with a smallworld topology, as the convergence speed is almost the same to a fully-connected topology, but uses 22\% less edges at that scale (14.5 edges on average instead of 18.9), and seems to have potential to have larger benefits at larger scales. Nonetheless, even the fully-connected topology offers significant benefits with 1000 nodes, as it represents a 98\% reduction in the number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96\% reduction in the number of messages (37.8 messages per round per node on average instead of 999). + +\begin{figure}[htbp] + \centering + % To regenerate the figure, from directory results/mnist + % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET ../scaling/1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET ../scaling/1000/mnist/smallworld-logn-cliques/all/2021-03-23-21:45:39-CET ../scaling/1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET ../scaling/1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET --add-min-max --yaxis test-accuracy --legend 'lower right' --ymin 84 --ymax 92.5 --labels '1 node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --save-figure ../../figures/d-cliques-mnist-1000-nodes-comparison.png --font-size 13 + \begin{subfigure}[b]{0.48\textwidth} \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-fully-connected-cst-updates} - \caption{Fully-Connected (Cliques), $O(\frac{n^2}{c^2} + nc)$ edges} + \includegraphics[width=\textwidth]{figures/d-cliques-mnist-1000-nodes-comparison} + \caption{\label{fig:d-cliques-mnist-1000-nodes-comparison} MNIST with Linear} \end{subfigure} - - % To regenerate the figure, from directory results/scaling -% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates.png --add-min-max - \begin{subfigure}[b]{0.7\textwidth} - \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates} - \caption{Fractal, $O(nc)$ edges} - \end{subfigure} - - - % To regenerate the figure, from directory results/scaling -% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/clique-ring/all/2021-03-10-11:58:43-CET 1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-clique-ring-cst-updates.png --add-min-max - \begin{subfigure}[b]{0.7\textwidth} + \hfill + % To regenerate the figure, from directory results/cifar10 +% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --add-min-max --yaxis test-accuracy --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png --font-size 13 + \begin{subfigure}[b]{0.48\textwidth} \centering - \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-clique-ring-cst-updates} - \caption{Ring, $O(n)$ edges} - \end{subfigure} - - \caption{\label{fig:d-cliques-cifar10-scaling-fully-connected} CIFAR10: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).} - \end{figure} + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy} +\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy} CIFAR10 with LeNet} + \end{subfigure} +\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.} +\end{figure} \section{Related Work} @@ -764,5 +685,137 @@ non-IID data. \end{subfigure} \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect} CIFAR10: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=20)} \end{figure} + +\begin{figure}[htbp] + \centering +% To regenerate the figure, from directory results/mnist +% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET fully-connected/all/2021-03-10-09:25:19-CET clique-ring/all/2021-03-10-18:14:35-CET fully-connected-cliques/all/2021-03-10-10:19:44-CET --add-min-max --yaxis test-accuracy --labels '1-node IID bsz=12800' '100-nodes non-IID fully-connected bsz=128' '100-nodes non-IID D-Cliques (Ring) bsz=128' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=128' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-vs-1-node-test-accuracy.png + \centering + \includegraphics[width=0.7\textwidth]{figures/d-cliques-mnist-vs-1-node-test-accuracy} + \caption{\label{fig:d-cliques-mnist-linear-w-clique-averaging-w-initial-averaging} MNIST: D-Cliques Convergence Speed (100 nodes, Constant Updates per Epoch)} +\end{figure} + + \begin{figure}[htbp] + \centering + % To regenerate the figure, from directory results/cifar10 +% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET clique-ring/all/2021-03-10-11:58:43-CET fully-connected-cliques/all/2021-03-10-13:58:57-CET --add-min-max --yaxis training-loss --labels '1-node IID bsz=2000' '100-nodes non-IID D-Cliques (Ring) bsz=20' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=20' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-vs-1-node-training-loss.png + \begin{subfigure}[b]{0.48\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-training-loss} +\caption{\label{fig:d-cliques-cifar10-training-loss} Training Loss} + \end{subfigure} + \hfill + % To regenerate the figure, from directory results/cifar10 +% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET clique-ring/all/2021-03-10-11:58:43-CET fully-connected-cliques/all/2021-03-10-13:58:57-CET --add-min-max --yaxis test-accuracy --labels '1-node IID bsz=2000' '100-nodes non-IID D-Cliques (Ring) bsz=20' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=20' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-vs-1-node-test-accuracy.png + \begin{subfigure}[b]{0.48\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-test-accuracy} +\caption{\label{fig:d-cliques-cifar10-test-accuracy} Test Accuracy} + \end{subfigure} +\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (100 nodes, Constant Updates per Epoch).} +\end{figure} + +\subsection{Scaling behaviour as the number of nodes increases?} + + \begin{figure}[htbp] + \centering + % To regenerate the figure, from directory results/scaling +% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/fully-connected-cliques/all/2021-03-12-09:13:27-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-updates.png --ymin 80 --add-min-max + + \begin{subfigure}[b]{0.7\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-updates} + \caption{Fully-Connected (Cliques), $O(\frac{n^2}{c^2} + nc)$ edges} + \end{subfigure} + + % To regenerate the figure, from directory results/scaling +% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates.png --ymin 80 --add-min-max + \begin{subfigure}[b]{0.7\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates} + \caption{Fractal, $O(nc)$ edges} + \end{subfigure} + + + % To regenerate the figure, from directory results/scaling +% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/clique-ring/all/2021-03-10-18:14:35-CET 1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-clique-ring-cst-updates.png --ymin 80 --add-min-max + \begin{subfigure}[b]{0.7\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-clique-ring-cst-updates} + \caption{Ring, $O(n)$ edges} + \end{subfigure} + + \caption{\label{fig:d-cliques-mnist-scaling-fully-connected} MNIST: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).} + \end{figure} + + \begin{figure}[htbp] + \centering + + % To regenerate the figure, from directory results/scaling +% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fully-connected-cst-updates.png --add-min-max + + \begin{subfigure}[b]{0.7\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-fully-connected-cst-updates} + \caption{Fully-Connected (Cliques), $O(\frac{n^2}{c^2} + nc)$ edges} + \end{subfigure} + + % To regenerate the figure, from directory results/scaling +% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates.png --add-min-max + \begin{subfigure}[b]{0.7\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates} + \caption{Fractal, $O(nc)$ edges} + \end{subfigure} + + + % To regenerate the figure, from directory results/scaling +% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/clique-ring/all/2021-03-10-11:58:43-CET 1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-clique-ring-cst-updates.png --add-min-max + \begin{subfigure}[b]{0.7\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-clique-ring-cst-updates} + \caption{Ring, $O(n)$ edges} + \end{subfigure} + + \caption{\label{fig:d-cliques-cifar10-scaling-fully-connected} CIFAR10: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).} + \end{figure} + + \begin{figure} +\centering + \begin{subfigure}[htb]{0.48\textwidth} +% To regenerate the figure, from directory results/mnist/gn-lenet +% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'lower right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymin 80 --yaxis test-accuracy --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies.png + \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies} + \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies} LeNet Model} + \end{subfigure} + \hfill + \begin{subfigure}[htb]{0.48\textwidth} +% To regenerate the figure, from directory results/mnist/gn-lenet +% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'upper right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymax 0.7 --yaxis scattering --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering.png + \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering} + \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering} LeNet Model (Scattering)} + \end{subfigure} + + \caption{\label{fig:d-cliques-mnist-comparison-to-non-clustered-topologies} MNIST: Comparison to non-Clustered Topologies} +\end{figure} + + \begin{figure} + \centering + % To regenerate the figure, from directory results/cifar10 +% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper right' --yaxis scattering --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering.png --ymax 0.7 + \begin{subfigure}[b]{0.48\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering} + \caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering} LeNet Model: Scattering} + \end{subfigure} + +\caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies} CIFAR10: Comparison to non-Clustered Topologies} +\end{figure} + + +\begin{itemize} + \item Clustering does not seem to make a difference in MNIST, even when using a higher-capacity model (LeNet) instead of a linear model. (Fig.\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}) + \item Except for the random 10 topology, convergence speed seems to be correlated with scattering in CIFAR-10 with LeNet model (Fig.\ref{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}). There is also more difference between topologies both in convergence speed and scattering than for MNIST (Fig.~\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}). Scattering computed similar to Consensus Control for Decentralized Deep Learning~\cite{kong2021consensus}. +\end{itemize} \end{document}