\maketitle% typeset the header of the contribution
\maketitle% typeset the header of the contribution
%
%
\begin{abstract}
\begin{abstract}
The convergence speed of machine learning models trained with Federated Learning is significantly affected by non-identically and independently distributed (non-IID) data partitions, even more so in a fully decentralized (serverless) setting. We propose the D-Cliques topology, which reduces gradient bias by grouping nodes in cliques such that their local joint distribution is representative of the global distribution. D-Cliques provide similar convergence speed as a fully-connected topology on MNIST and CIFAR10 with a significant reduction in the number of required edges and messages: at a scale of 1000nodes, 98\% less edges and 96\% less total messages. We show how D-Cliques can be used to successfully implement momentum, critical to quickly train deep convolutional networks but otherwise detrimental in a non-IID setting. We finally show that, among many possible inter-clique topologies, a small-world topology that scales the number of edges logarithmically in the number of nodes converges almost as quickly as fully connecting cliques with a single edge pairwise. A smallworld topology thus provides a further 22\% reduction in the number of edges at 1000nodes (14.6 vs 18.9 edges on average per node), which suggests bigger possible gains at larger scales.
The convergence speed of machine learning models trained with Federated Learning is significantly affected by non-independent and identically distributed (non-IID) data partitions, even more so in a fully decentralized setting without a central server. In this paper, we show that the impact can be significantly reduced by carefully designing the communication topology. We present D-Cliques, a novel topology that reduces gradient bias by grouping nodes in cliques such that their local joint distribution is representative of the global class distribution. We show how D-Cliques can be used to successfully implement momentum, which is critical to efficiently train deep convolutional networks but otherwise detrimental in a non-IID setting. We then present an extensive experimental study on MNIST and CIFAR10 and demonstrate that D-Cliques provides similar convergence speed as a fully-connected topology with a significant reduction in the number of required edges and messages. In a 1000-node topology, D-Cliques requires 98\% less edges and 96\% less total messages to achieve a similar accuracy. Our study suggests that maintaining full connectivity within cliques is necessary for fast convergence, but we show that we can reduce inter-clique connectivity by relying on a small-world topology with a logarithmic number of edges without damaging the convergence speed. D-cliques with a small-world topology leads to a further 22\% reduction in the number of edges in a 1000-node network, with even bigger gains expected at larger scales.
@@ -227,28 +227,25 @@ We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{
...
@@ -227,28 +227,25 @@ We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{
\end{algorithmic}
\end{algorithmic}
\end{algorithm}
\end{algorithm}
D-PSGD can be used with a variety of models, including deep learning networks. To remove the impact of particular architectural choices on our results, we use a linear classifier (CITE). This model provides up to 92.5\% accuracy when fully converged on MNIST (CITE), about 7\% less than state-of-the-art deep learning networks (CITE).
D-PSGD can be used with a variety of models, including deep learning networks. In the rest of this paper, we use it with a linear (regression) model on MNIST, and with a deep convolutional network on CIFAR10.
%\subsection{Clustering}
%\subsection{Clustering}
%
%
%From the perspective of one node, \textit{clustering} intuitively represents how many connections exist between its immediate neighbours. A high level of clustering means that neighbours have many edges between each other. The highest level is a \textit{clique}, where all nodes in the neighbourhood are connected to one another. Formally, the level of clustering, between $0$ and $1$, is the ratio of $\frac{\textit{nb edges between neighbours}}{\textit{nb possible edges}}$~\cite{watts2000small}.
%From the perspective of one node, \textit{clustering} intuitively represents how many connections exist between its immediate neighbours. A high level of clustering means that neighbours have many edges between each other. The highest level is a \textit{clique}, where all nodes in the neighbourhood are connected to one another. Formally, the level of clustering, between $0$ and $1$, is the ratio of $\frac{\textit{nb edges between neighbours}}{\textit{nb possible edges}}$~\cite{watts2000small}.
%
%
\section{Motivation: Bias in Gradient Averaging with Non-IID Data}
\aurelien{I think this should go into the approach section, to motivate it.
\section{D-Cliques: Locally Recover IID within Cliques}
In the introduction, maybe we can just give the main intuitions in a few
sentences?}
To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}, as illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. The color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In this (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid, with neighbourhood such as this one, sometimes having nodes with examples of the same class adjacent to each other.
To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}, as illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. The color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In the other (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid, with neighbourhood such as this one, sometimes having nodes with examples of the same class adjacent to each other.
@@ -257,36 +254,34 @@ To have a preliminary intuition of the impact of non-IID data on convergence spe
...
@@ -257,36 +254,34 @@ To have a preliminary intuition of the impact of non-IID data on convergence spe
\label{fig:grid-iid-vs-non-iid-neighbourhood}
\label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}
\end{figure}
For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence in an IID setting but makes the comparison easier. A single training step, from the point of view of the middle node of the illustrated neighbourhood, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes.
For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence in an IID setting but makes the comparison easier. A single training step, from the point of view of the middle node of Figure~\ref{fig:grid-IID-vs-non-IID}, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes.
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. There is some variance remaining from the difference between examples but in practice it has a sufficiently small impact on convergence speed that there are still benefits from parallelizing the computations.
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. There is some variance remaining from the difference between examples but in practice it has a sufficiently small impact on convergence speed that there are still benefits from parallelizing the computations.
However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. As the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical.
However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. As the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical.
\subsection{D-Cliques}
\aurelien{this should definitely go to approach section}
\caption{\label{fig:d-cliques-example} D-Cliques: Connected Cliques of Dissimilar Nodes, Locally Representative of the Global Distribution}
\end{figure}
If we relax the constraint of regularity, a trivial solution is a star topology, as used in most Federated Learning implementations (CITE) at the expense of a high requirement on reliability and available bandwidth on the central node. We instead propose a regular topology, built around \textit{cliques} of dissimilar nodes, locally representative of the global distribution and connected by few links, as illustrated in Figure~\ref{fig:d-cliques-example}. D-Cliques enable similar convergence speed as a fully connected topology, using a number of edges that grows sub-exponentially ($O(nc +\frac{n^2}{c^2})$ where $n$ is the number of nodes and $c$ is the size of a clique\footnote{$O((\frac{n}{c})c^2+(\frac{n}{c})^2)$, i.e. number of cliques times the number of edges within cliques (squared in the size of cliques) in addition to inter-cliques edges (square of the number of cliques).}.), instead of exponentially in the number of nodes ($O(n^2)$), with a corresponding reduction in bandwidth usage and required number of messages per round of training. In practice, for the cases with networks of size 100 we have tested, that corresponds to a reduction in the number of edges of 90\%. (TODO: Do analysis if the pattern is fractal with three levels at 1000 nodes: cliques, 10 cliques connected pairwise in a "region", and each "region" connected pairwise with other regions)
Because the data distribution within each clique is representative of the global distribution, we can recover optimization techniques that rely on an IID assumption, in a distributed setting that is not. As one example, we show how momentum (CITE) can be used with D-Cliques to greatly improve convergence speed of convolutional networks, as in a centralized IID setting, even though the technique is otherwise \textit{detrimental} in a more general non-IID setting.
Three Main ideas:
\begin{itemize}
\item Create cliques such that the clique distribution is representative of the global distribution
\item Connect cliques, based on level of "redundancy" in the datasets
\item Decouple gradient averaging from weight averaging
\caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed on MNIST}
\end{subfigure}
\caption{\label{fig:d-cliques-example} D-Cliques}
\end{figure}
The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.
The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.
The global distribution of classes, for classification tasks, can be computed from the distribution of class examples on the nodes, with Distributed Averaging (CITE). Given the global distribution of classes, neighbours within cliques can be chosen based on a PeerSampling (CITE) service. Both services can be implemented such that they converge in a logarithmic number of steps compared to the number of nodes. It is therefore possible to obtain this information in a scalable way.
The global distribution of classes, for classification tasks, can be computed from the distribution of class examples on the nodes, with Distributed Averaging (CITE). Given the global distribution of classes, neighbours within cliques can be chosen based on a PeerSampling (CITE) service. Both services can be implemented such that they converge in a logarithmic number of steps compared to the number of nodes. It is therefore possible to obtain this information in a scalable way.
...
@@ -295,25 +290,17 @@ The global distribution of classes, for classification tasks, can be computed fr
...
@@ -295,25 +290,17 @@ The global distribution of classes, for classification tasks, can be computed fr
TODO: Algo de construction des cliques (incluant les cas où les classes ne sont pas également représentées)
TODO: Algo de construction des cliques (incluant les cas où les classes ne sont pas également représentées)
\subsection{Connecting Cliques}
The \textit{"redundancy"} of the data, i.e. how much each additional example in the training set contributes to the final accuracy, influences the minimum number of connections required between cliques to reach a given convergence speed. It needs to be evaluated empirically on a learning task. In effect, redundancy is the best parallelization factor as the more redundant the dataset is, the less nodes need to communicate. For the following arguments, $n$ is the number of nodes and $c$ is the size of a clique.
For highly redundant datasets, it may be sufficient to arrange cliques in a ring. This is not specific to D-Cliques, it is also the case with IID nodes but it is nonetheless useful to be kept in mind for D-Cliques also. In this case, the number of edges will be $O(nc +\frac{n}{c})$ and therefore linear in the number of nodes $n$.
For cases with limited redundancy, nodes can be arranged such that they are at most 2 hops away from any other nodes in the network to quickly propagate updates in the network. In effect, this is equivalent to fully connecting cliques (instead of nodes). In this case, the number of edges will be $O(nc +\frac{n^2}{c^2})$ and therefore still exponential in the number of nodes but with a strong reduction in the number of edges when $c$ is large compared to $n$ (ex: $c \geq\frac{n}{100}$).
\section{Removing Gradient Bias from Inter-Clique Edges}
In between, there might be enough redundancy in the dataset to arrange cliques in a fractal/hierarchical pattern such that the maximum number of hops between nodes grows logarithmically with $n$. TODO: Complexity argument.
\subsection{Decoupling Gradient Averaging from Weight Averaging}
Inter-clique connections create sources of bias. The distributed averaging algorithm, used by D-PSGD, relies on a good choice of weights for quick convergence, of which Metropolis-Hasting (CITE) provide a reasonable and inexpensive solution by considering only the immediate neighbours of every node. However, by averaging models after a gradient step, D-PSGD effectively gives a different weight to the gradient of neighbours.
Inter-clique connections create sources of bias. The distributed averaging algorithm, used by D-PSGD, relies on a good choice of weights for quick convergence, of which Metropolis-Hasting (CITE) provide a reasonable and inexpensive solution by considering only the immediate neighbours of every node. However, by averaging models after a gradient step, D-PSGD effectively gives a different weight to the gradient of neighbours.
\caption{\label{fig:connected-cliques-bias} Sources of Bias in Connected Cliques: Non-uniform weights in neighbours of A (A has a higher weight); Non-uniform class representation in neighbours of B (extra green node).}
\caption{\label{fig:connected-cliques-bias} Sources of Bias in Connected Cliques: Non-uniform weights in neighbours of A (A has a higher weight); Non-uniform class representation in neighbours of B (extra green node).}
\end{figure}
\end{figure}
Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge connecting the green node of the left clique with the purple node of the right clique. A simple Metropolis-Hasting weight assignment such as the following:
Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge connecting the green node of the left clique with the purple node of the right clique. A simple Metropolis-Hasting weight assignment such as the following:
\begin{equation}
\begin{equation}
...
@@ -341,89 +328,110 @@ We solve this problem by decoupling the gradient averaging from the weight avera
...
@@ -341,89 +328,110 @@ We solve this problem by decoupling the gradient averaging from the weight avera
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST}
\end{figure}
\subsection{MNIST and Linear Model}
\begin{figure}[htbp]
\section{Implementing Momentum}
\centering
Momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps, is critical for making convolutional networks converge quickly. However it relies on mini-batches to be IID, otherwise, it greatly increases variance between nodes and is actually detrimental to convergence speed.
Momentum can easily be used with D-Cliques, simply by calculating it from the clique-unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)}\leftarrow m v_i^{(k-1)} + g_i^{(k)}
\end{equation}
It then suffices to modify the original gradient step to use momentum:
In addition, it is important that all nodes are initialized with the same model value at the beginning. Otherwise, the random initialization of models introduces another source of variance that persists over many steps. In combination with D-Clique (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), this provides the convergence results of Figure~\ref{fig:d-cliques-cifar10-convolutional}. To assess how far this would be from an "optimal" solution, in which the delay introduced by multiple hops between nodes is completely removed, we also show the convergence speed of a single node that would compute its average gradient from all the samples obtained by all nodes in a single round. The results show that minus the variance introduced by the multiple hops between nodes, which slows the convergence of the distributed averaging of models, the convergence speed on average is close to the optimal, when the distributed average is computed exactly every step.
\section{Scaling with Different Inter-Clique Topologies}
The \textit{"redundancy"} of the data, i.e. how much each additional example in the training set contributes to the final accuracy, influences the minimum number of connections required between cliques to reach a given convergence speed. It needs to be evaluated empirically on a learning task. In effect, redundancy is the best parallelization factor as the more redundant the dataset is, the less nodes need to communicate. For the following arguments, $n$ is the number of nodes and $c$ is the size of a clique.
For highly redundant datasets, it may be sufficient to arrange cliques in a ring. This is not specific to D-Cliques, it is also the case with IID nodes but it is nonetheless useful to be kept in mind for D-Cliques also. In this case, the number of edges will be $O(nc +\frac{n}{c})$ and therefore linear in the number of nodes $n$.
For cases with limited redundancy, nodes can be arranged such that they are at most 2 hops away from any other nodes in the network to quickly propagate updates in the network. In effect, this is equivalent to fully connecting cliques (instead of nodes). In this case, the number of edges will be $O(nc +\frac{n^2}{c^2})$ and therefore still exponential in the number of nodes but with a strong reduction in the number of edges when $c$ is large compared to $n$ (ex: $c \geq\frac{n}{100}$).
In between, there might be enough redundancy in the dataset to arrange cliques in a fractal/hierarchical pattern such that the maximum number of hops between nodes grows logarithmically with $n$. TODO: Complexity argument.
If we relax the constraint of regularity, a trivial solution is a star topology, as used in most Federated Learning implementations (CITE) at the expense of a high requirement on reliability and available bandwidth on the central node. We instead propose a regular topology, built around \textit{cliques} of dissimilar nodes, locally representative of the global distribution and connected by few links, as illustrated in Figure~\ref{fig:d-cliques-example}. D-Cliques enable similar convergence speed as a fully connected topology, using a number of edges that grows sub-exponentially ($O(nc +\frac{n^2}{c^2})$ where $n$ is the number of nodes and $c$ is the size of a clique\footnote{$O((\frac{n}{c})c^2+(\frac{n}{c})^2)$, i.e. number of cliques times the number of edges within cliques (squared in the size of cliques) in addition to inter-cliques edges (square of the number of cliques).}.), instead of exponentially in the number of nodes ($O(n^2)$), with a corresponding reduction in bandwidth usage and required number of messages per round of training. In practice, for the cases with networks of size 100 we have tested, that corresponds to a reduction in the number of edges of 90\%. (TODO: Do analysis if the pattern is fractal with three levels at 1000 nodes: cliques, 10 cliques connected pairwise in a "region", and each "region" connected pairwise with other regions)
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/mnist
% To regenerate the figure, from directory results/mnist
\caption{\label{fig:d-cliques-cifar10-test-accuracy} Test Accuracy}
\end{subfigure}
\end{subfigure}
\caption{\label{fig:d-cliques-mnist-init-clique-avg-effect} MNIST: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=128)}
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (100 nodes, Constant Updates per Epoch).}
\end{figure}
\end{figure}
% REMOVED: Constant Batch-size
% % To regenerate the figure, from directory results/scaling
\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy} Test Accuracy}
\end{subfigure}
\end{subfigure}
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (1000 nodes, Constant Updates per Epoch).}
\end{figure}
% To regenerate the figure, from directory results/scaling
\caption{\label{fig:d-cliques-mnist-scaling-fully-connected} MNIST: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).}
\end{figure}
\section{Evaluation}
\subsection{How do D-Cliques compare to similar non-clustered topologies?}
To remove the impact of particular architectural choices on our results, we use a linear classifier (CITE). This model provides up to 92.5\% accuracy when fully converged on MNIST (CITE), about 7\% less than state-of-the-art deep learning networks (CITE).
\begin{figure}[htbp]
\begin{figure}[htbp]
\centering
\centering
\begin{subfigure}[htb]{0.48\textwidth}
\begin{subfigure}[htb]{0.48\textwidth}
...
@@ -460,112 +468,91 @@ We solve this problem by decoupling the gradient averaging from the weight avera
...
@@ -460,112 +468,91 @@ We solve this problem by decoupling the gradient averaging from the weight avera
\caption{\label{fig:d-cliques-mnist-comparison-to-non-clustered-topologies} MNIST: Comparison to non-Clustered Topologies}
\caption{\label{fig:d-cliques-mnist-comparison-to-non-clustered-topologies} MNIST: Comparison to non-Clustered Topologies}
\end{figure}
\end{figure}
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper left' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png --ymax 100
\caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies} CIFAR10: Comparison to non-Clustered Topologies}
\end{figure}
\begin{itemize}
\begin{itemize}
\item Clustering does not seem to make a difference in MNIST, even when using a higher-capacity model (LeNet) instead of a linear model. (Fig.\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies})
\item Clustering does not seem to make a difference in MNIST, even when using a higher-capacity model (LeNet) instead of a linear model. (Fig.\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies})
\item Except for the random 10 topology, convergence speed seems to be correlated with scattering in CIFAR-10 with LeNet model (Fig.\ref{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}). There is also more difference between topologies both in convergence speed and scattering than for MNIST (Fig.~\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}). Scattering computed similar to Consensus Control for Decentralized Deep Learning~\cite{kong2021consensus}.
\end{itemize}
\end{itemize}
\subsection{Is it important to maintain full connectivity within cliques?}
\caption{\label{fig:d-cliques-mnist-clique-clustering} MNIST: Effect of Relaxed Intra-Clique Connectivity.}
\caption{\label{fig:d-cliques-mnist-clique-clustering} MNIST: Effect of Relaxed Intra-Clique Connectivity.}
\end{figure}
\end{figure}
\clearpage
\subsection{CIFAR10 and Convolutional Model}
Momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps, is critical for making convolutional networks converge quickly. However it relies on mini-batches to be IID, otherwise, it greatly increases variance between nodes and is actually detrimental to convergence speed.
Momentum can easily be used with D-Cliques, simply by calculating it from the clique-unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)}\leftarrow m v_i^{(k-1)} + g_i^{(k)}
\end{equation}
It then suffices to modify the original gradient step to use momentum:
In addition, it is important that all nodes are initialized with the same model value at the beginning. Otherwise, the random initialization of models introduces another source of variance that persists over many steps. In combination with D-Clique (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), this provides the convergence results of Figure~\ref{fig:d-cliques-cifar10-convolutional}. To assess how far this would be from an "optimal" solution, in which the delay introduced by multiple hops between nodes is completely removed, we also show the convergence speed of a single node that would compute its average gradient from all the samples obtained by all nodes in a single round. The results show that minus the variance introduced by the multiple hops between nodes, which slows the convergence of the distributed averaging of models, the convergence speed on average is close to the optimal, when the distributed average is computed exactly every step.
\begin{figure}[htbp]
\begin{figure}[htbp]
\centering
\centering
% To regenerate the figure, from directory results/cifar10
% To regenerate the figure, from directory results/cifar10
\caption{\label{fig:d-cliques-cifar10-test-accuracy} Test Accuracy}
\end{subfigure}
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (100 nodes, Constant Updates per Epoch).}
\end{figure}
\end{figure}
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/cifar10
\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy} Test Accuracy}
\end{subfigure}
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (1000 nodes, Constant Updates per Epoch).}
\end{figure}
\clearpage
\subsection{Scaling behaviour as the number of nodes increases?}
\begin{figure}[htbp]
\centering
\begin{figure}[htbp]
% To regenerate the figure, from directory results/cifar10
\caption{\label{fig:d-cliques-mnist-scaling-fully-connected} MNIST: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).}
\end{figure}
\begin{figure}[htbp]
\begin{figure}[htbp]
\centering
\centering
...
@@ -598,58 +585,6 @@ In addition, it is important that all nodes are initialized with the same model
...
@@ -598,58 +585,6 @@ In addition, it is important that all nodes are initialized with the same model
\caption{\label{fig:d-cliques-cifar10-scaling-fully-connected} CIFAR10: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).}
\caption{\label{fig:d-cliques-cifar10-scaling-fully-connected} CIFAR10: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).}
\end{figure}
\end{figure}
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper left' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png --ymax 100
\caption{\label{fig:d-cliques-cifar10-clique-clustering} CIFAR10: Effect of Relaxed Intra-Clique Connectivity.}
\end{figure}
\begin{itemize}
\item Except for the random 10 topology, convergence speed seems to be correlated with scattering in CIFAR-10 with LeNet model (Fig.\ref{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}). There is also more difference between topologies both in convergence speed and scattering than for MNIST (Fig.~\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}). Scattering computed similar to Consensus Control for Decentralized Deep Learning~\cite{kong2021consensus}.
\end{itemize}
\section{XP Suppl.}
\begin{itemize}
\item Test topology en n-log n https://dl.acm.org/doi/10.1145/335305.335325
\item Enlever les mentions à l'initialisation uniforme
\end{itemize}
\section{Related Work}
\section{Related Work}
...
@@ -756,4 +691,57 @@ non-IID data.
...
@@ -756,4 +691,57 @@ non-IID data.
%
%
\bibliographystyle{splncs04}
\bibliographystyle{splncs04}
\bibliography{main}
\bibliography{main}
\newpage
\appendix
\section{Other Experiments}
% REMOVED: Constant Batch-size
% % To regenerate the figure, from directory results/scaling