From 1f3fe543c8e21d8e2f8ea40a54f8ad7f58cf1169 Mon Sep 17 00:00:00 2001
From: Erick Lavoie <erick.lavoie@epfl.ch>
Date: Mon, 29 Mar 2021 10:37:03 +0200
Subject: [PATCH] Fixed typos and mistakes, made anonymous

---
 main.tex | 61 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 31 insertions(+), 30 deletions(-)

diff --git a/main.tex b/main.tex
index d6dc33e..bbb3d54 100644
--- a/main.tex
+++ b/main.tex
@@ -1,7 +1,3 @@
-% This is samplepaper.tex, a sample chapter demonstrating the
-% LLNCS macro package for Springer Computer Science proceedings;
-% Version 2.20 of 2017/10/04
-%
 \documentclass[runningheads]{llncs}
 %
 \usepackage[utf8]{inputenc}
@@ -36,19 +32,19 @@
 % If the paper title is too long for the running head, you can set
 % an abbreviated paper title here
 %
-\author{Aur\'elien Bellet\inst{1}\thanks{Authors in alphabetical order of last names, see Section 'Credits' for respective contributions.} \and
-Anne-Marie Kermarrec\inst{2} \and
-Erick Lavoie\inst{2}}
-%
-\authorrunning{A. Bellet, A-M. Kermarrec, E. Lavoie}
-% First names are abbreviated in the running head.
-% If there are more than two authors, 'et al.' is used.
-%
-\institute{Inria, Lille, France\\
-\email{aurelien.bellet@inria.fr} \and
-EPFL, Lausanne, Switzerland \\
-\email{\{anne-marie.kermarrec,erick.lavoie\}@epfl.ch}\\
-}
+%\author{Aur\'elien Bellet\inst{1}\thanks{Authors in alphabetical order of last names, see Section 'Credits' for respective contributions.} \and
+%Anne-Marie Kermarrec\inst{2} \and
+%Erick Lavoie\inst{2}}
+%%
+%\authorrunning{A. Bellet, A-M. Kermarrec, E. Lavoie}
+%% First names are abbreviated in the running head.
+%% If there are more than two authors, 'et al.' is used.
+%%
+%\institute{Inria, Lille, France\\
+%\email{aurelien.bellet@inria.fr} \and
+%EPFL, Lausanne, Switzerland \\
+%\email{\{anne-marie.kermarrec,erick.lavoie\}@epfl.ch}\\
+%}
 %
 \maketitle              % typeset the header of the contribution
 %
@@ -262,7 +258,7 @@ For the sake of the argument, assume all nodes are initialized with the same mod
 
 In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. 
 
-However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. Moreover, as the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical.
+However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. Moreover, as the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical.
 
 \begin{algorithm}[h]
    \caption{D-Clique Construction}
@@ -290,9 +286,9 @@ However, in the (rather extreme) non-IID case illustrated, there are not enough
 
 Under our non-IID assumptions (Section~\ref{section:non-iid-assumptions}), a balanced representation of classes, similar to that of the IID case, can be recovered by modifying the topology such that each node has direct neighbours of all classes. Moreover, as we shall show in the next sections, there are benefits in ensuring the clustering of neighbours into a \textit{clique}, such that, within a clique, neighbours of a node are also directly connected. To ensure all cliques still converge to a single model, a number of inter-clique connections are introduced, established directly between nodes that are part of cliques. Because the joint location distributions $D_{\textit{clique}} = \sum_{i \in \textit{clique}} D_i$ is representative of the global distribution, similar to the IID case, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. And because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. 
 
-The construction of the resulting \textit{decentralized cliques} (d-cliques) topology can be performed with Algorithm~\ref{Algorithm:D-Clique-Construction}. Essentially, each clique $C$ are constructed one at a time by selecting nodes with differing classes. Once all cliques are constructed, intra-clique and inter-clique edges are added. 
+The construction of the resulting \textit{decentralized cliques} (d-cliques) topology can be performed with Algorithm~\ref{Algorithm:D-Clique-Construction}. Essentially, each clique $C$ is constructed one at a time by selecting nodes with differing classes. Once all cliques are constructed, intra-clique and inter-clique edges are added. 
 
-Finally, weights are assigned to edges to ensure quick convergence, for this study we use Metropolis-Hasting (CITE), which while not offering optimal convergence speed in the general case, provides good convergence by taking into account the degree of immediate neighbours:
+Finally, weights are assigned to edges to ensure quick convergence. For this study we use Metropolis-Hasting (CITE), which while not offering optimal convergence speed in the general case, provides good convergence by taking into account the degree of immediate neighbours:
 
 \begin{equation}
   W_{ij} = \begin{cases}
@@ -301,7 +297,7 @@ Finally, weights are assigned to edges to ensure quick convergence, for this stu
   \end{cases}
 \end{equation}
 
-In this paper, we focus on showing the convergence benefits of such a topology for decentralized federated learning. Algorithm~\ref{Algorithm:D-Clique-Construction} therefore centrally generates the topology, which is then tested in a simulator. We expect this algorithm should be straightforward to adapt for a decentralized execution: the computation of the classes globally present, $L$, could be computed PushSum (CITE), and the section of neighbours done with PeerSampling (CITE).
+In this paper, we focus on showing the convergence benefits of such a topology for decentralized federated learning. Algorithm~\ref{Algorithm:D-Clique-Construction} therefore centrally generates the topology, which is then tested in a simulator. We expect this algorithm should be straightforward to adapt for a decentralized execution: the computation of the classes globally present, $L$, could be computed PushSum (CITE), and the selection of neighbours done with PeerSampling (CITE).
 
 \begin{figure}[htbp]
     \centering 
@@ -347,7 +343,7 @@ Inter-clique connections create sources of bias. By averaging models after a gra
 \end{figure}
 
 Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge connecting the green node of the left clique with the purple node of the right clique.
-Node A will have a weight of $\frac{12}{110}$ while all of A's neighbours will have a weight of $\frac{11}{110}$, except the green node connected to B, that will have a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's class and aways from the green class. The same analysis holds for all other nodes without inter-clique edges. For node B, all neighbours and B will have weights of $\frac{1}{11}$. However, the green class is represented twice while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.
+Node A will have a weight of $\frac{12}{110}$ while all of A's neighbours will have a weight of $\frac{11}{110}$, except the green node connected to B, that will have a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's class and away from the green class. The same analysis holds for all other nodes without inter-clique edges. For node B, all neighbours and B will have weights of $\frac{1}{11}$. However, the green class is represented twice while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.
 
 \begin{algorithm}[h]
    \caption{Clique-Unbiased D-PSGD, Node $i$}
@@ -363,7 +359,7 @@ Node A will have a weight of $\frac{12}{110}$ while all of A's neighbours will h
    \end{algorithmic}
 \end{algorithm}
 
-We solve this problem with a clique-unbiased version of D-PSGD, listed in Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: gradient averaging is decoupled from weight averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, which provides an equal representation to all classes in the computation of the average gradient. But the models of all neighbours, including those across inter-clique edges, are used for computing the distributed average of models as in the original version.
+We solve this problem with a clique-unbiased version of D-PSGD, listed in Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}: gradient averaging is decoupled from weight averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, which provides an equal representation to all classes in the computation of the average gradient. But all models of neighbours, including those across inter-clique edges, are used for computing the distributed average of models as in the original version.
 
 % To regenerate figure, from results/mnist:
 % python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png
@@ -373,13 +369,13 @@ We solve this problem with a clique-unbiased version of D-PSGD, listed in Algori
 \caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
 \end{figure}
 
-As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed: the node with lowest accuracy performs as well as the average nodes when not using clique averaging. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. These benefits are obtained at a higher messaging cost, double to that without clique averaging, and increases latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. MNIST and a Linear model are relatively simple, so the next section shows to work with a harder dataset and a higher capacity model.
+As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed: the node with lowest accuracy performs as well as nodes on average when not using clique averaging. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. These benefits are obtained at a higher messaging cost, double to that without clique averaging, and increases latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. MNIST and a Linear model are relatively simple, so the next section shows to work with a harder dataset and a higher capacity model.
 
 \section{Implementing Momentum with Clique Averaging}
 
 Training higher capacity models, such as a deep convolutional network, on harder datasets, such as CIFAR10, is usually done with additional optimization techniques to accelerate convergence speed in centralized settings. But sometimes, these techniques rely on an IID assumption in local distributions which does not hold in more general cases. We show here how Clique Averaging (Section~\ref{section:clique-averaging}) easily enables the implementation of these optimization techniques in the more general non-IID setting with D-Cliques.
 
-In particular, we implement momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps. Momentum is critical for making deep convolutional networks, such as LeNet, converge quickly. However, a simpler application of momentum in a non-IID setting can actually be detrimental. As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}, the convergence of LeNet on CIFAR10 with momentum with 100 nodes using the d-cliques topology is so bad that the network actually fails to converge.  To put things in context, we compare the convergence speed to a single centralized IID node performing the same number of updates per epoch, therefore using a batch size 100 larger: this is essentially equivalent to completely removing the impact of the topology, non-IIDness, and decentralized averaging on the convergence speed. As shown, not using momentum gives a better convergence speed, but this is still far off from the one that would be obtained with a single centralized IID node, so momentum is actually necessary.
+In particular, we implement momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps. Momentum is critical for making deep convolutional networks, such as LeNet, converge quickly. However, a simpler application of momentum in a non-IID setting can actually be detrimental. As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}, the convergence of LeNet on CIFAR10 with momentum with 100 nodes using the d-cliques topology is so bad that the network actually fails to converge.  To put things in context, we compare the convergence speed to a single centralized IID node performing the same number of updates per epoch, therefore using a batch size 100 times larger: this is essentially equivalent to completely removing the impact of the topology, non-IIDness, and decentralized averaging on the convergence speed. As shown, not using momentum gives a better convergence speed, but this is still far off from the one that would be obtained with a single centralized IID node, so momentum is actually necessary.
 
 \begin{figure}[htbp]
     \centering 
@@ -414,9 +410,9 @@ Using momentum closes the gap, with a slightly lower convergence speed in the fi
 
  \section{Comparison to Similar Non-Clustered Topologies}
 
-We have previously shown that D-Cliques, can effectively provide similar convergence speed as a fully-connected topology and even a single IID node. We now show, in this section and the next, that the particular structure of D-Cliques is necessary. In particular, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare d-cliques, with and without clique averaging, to a random topology chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the D-Clique topology of Fig.~\ref{fig:d-cliques-figure} on 100 nodes. To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours, but since nodes do not form a clique, no node actually compute the same resulting average gradient.
+We have previously shown that D-Cliques, can effectively provide similar convergence speed as a fully-connected topology and even a single IID node. We now show, in this section and the next, that the particular structure of D-Cliques is necessary. In particular, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours, but since nodes do not form a clique, no node actually compute the same resulting average gradient.
 
-Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse  than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems clustering in this case has a small but significant effect.
+Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse  than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems that clustering in this case has a small but significant effect.
 
 \begin{figure}[htbp]
      \centering     
@@ -438,7 +434,7 @@ Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-
  \caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies} 
 \end{figure}
 
-For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact but the exact nature of it is still an open question.
+For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
 
 \section{Importance of Intra-Clique Full Connectivity}
 
@@ -561,7 +557,7 @@ between updates (see \cite{consensus_distance} and references therein). These
 algorithms
 typically require additional communication and/or computation.\footnote{We
 also observed that \cite{tang18a} is subject to numerical
-instabilities when run on topologies other than rings and grids. When
+instabilities when run on topologies other than rings. When
 the rows and columns of $W$ do not exactly
 sum to $1$ (due to finite precision), these small differences get amplified by
 the proposed updates and make the algorithm diverge.}\aurelien{emphasize that
@@ -612,6 +608,11 @@ non-IID data.
 
 \section{Conclusion}
 
+We have shown the significant impact of the topology with non-IID data partitions in decentralized federated learning. We have proposed D-Cliques, a sparse topology that recovers the convergence speed and non-IID compensating behaviour of a fully-connected topology. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually possible with IID mini-batches.
+
+We have shown that the clustering of nodes in D-Cliques seems to benefit convergence speed, especially on CIFAR10 with a deep convolutional network, as random topologies with the same number of edges, diversity of nodes, and even unbiased gradients, converge slower. We have also show that full connectivity within cliques is critical, as removing even a single edge per clique significantly slows down converge.
+
+Finally, we have evaluated different inter-clique topologies with 1000 nodes and while they all provide significant reduction in the number of edges compared to fully connecting all nodes, a smallworld approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed.
 
 %\section{Future Work}
 %\begin{itemize}
@@ -623,7 +624,7 @@ non-IID data.
 %\end{itemize}
 
 
-\section{Credits}
+%\section{Credits}
 
 %
 % ---- Bibliography ----
-- 
GitLab