Reworked first sections

ca75fd63 · Erick Lavoie · 1045e163 · ca75fd63
Commit ca75fd63 authored 3 years ago by Erick Lavoie
--- a/mlsys2022style/main.tex
+++ b/mlsys2022style/main.tex
@@ -9,14 +9,16 @@
 \usepackage[utf8]{inputenc}
 \usepackage{amsmath}
 \usepackage{amsfonts}
+\usepackage{mathtools}
 \usepackage{amssymb}
 \usepackage{xcolor}
 \usepackage{soul}
-%\usepackage{algorithm}
+\usepackage[noend]{algorithmic}
 %\usepackage[noend]{algpseudocode}
 \usepackage{dsfont}
 \usepackage{caption}
 \usepackage{subcaption}
+\usepackage{todonotes}
 % hyperref makes hyperlinks in the resulting PDF.
 % If your build breaks (sometimes temporarily if a hyperlink spans a page)
@@ -157,21 +159,7 @@ enough such that all participants need only to communicate with a small number o
 topologies like rings or grids do not significantly affect the convergence
 speed compared to using denser topologies.
-In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
+\begin{figure*}[ht]
-in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that  a ring or
-a grid topology clearly jeopardizes the convergence speed as local
-distributions do not have relative frequency of classes similar to the global
-distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
-that, unlike in centralized FL
-\cite{kairouz2019advances,scaffold,quagmire}, this
-happens even when nodes perform a single local update before averaging the
-model with their neighbors. In this paper, we address the following question:
-\textit{Can we design sparse topologies with  convergence
-  speed similar to the one obtained in a  fully connected network under
-  a large number of participants with local class bias?}
-\begin{figure*}[t]
     \centering
     % From directory results/mnist
@@ -210,21 +198,39 @@ model with their neighbors. In this paper, we address the following question:
        \label{fig:iid-vs-non-iid-problem}
 \end{figure*}
+In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
+in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that  a ring or
+a grid topology clearly jeopardizes the convergence speed as local
+distributions do not have relative frequency of classes similar to the global
+distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
+that, unlike in centralized FL
+\cite{kairouz2019advances,scaffold,quagmire}, this
+happens even when nodes perform a single local update before averaging the
+model with their neighbors. In this paper, we address the following question:
+\textit{Can we design sparse topologies with  convergence
+  speed similar to the one obtained in a  fully connected network under
+  a large number of participants with local class bias?}
 Specifically, we make the following contributions:
 (1) We propose D-Cliques, a sparse topology in which nodes are organized in
 interconnected cliques, i.e. locally fully-connected sets of nodes, such that
-the joint data distribution of each clique is representative of the global 
+the joint data distribution of each clique is close to that of the global 
-(IID) distribution; (2) We propose Clique Averaging, a  modified version of 
+(IID) distribution; (2) We propose Greedy Swap, an algorithm for constructing
+cliques efficiently in the presence of heterogeneity previously studied in the context of Federated Learning~\cite{mcmahan2016communication};
+ (3) We propose Clique Averaging, a  modified version of 
 the standard D-SGD algorithm which decouples gradient averaging, used for
 optimizing local models, from distributed averaging, used to ensure all models
 converge, therefore reducing the bias introduced by inter-clique connections; 
-(3) We show how Clique Averaging can be used to implement unbiased momentum
+(4) We show how Clique Averaging can be used to implement unbiased momentum
-that would otherwise be detrimental in the non-IID setting; (4) We 
+that would otherwise be detrimental in the non-IID setting; (5) We 
 demonstrate
 through an extensive experimental study that our approach  removes the effect
 of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
 \cite{krizhevsky2009learning} datasets, for training a linear model and a deep
-convolutional network;  (5) Finally, we demonstrate the scalability of our
+convolutional network;  (6) Finally, we demonstrate the scalability of our
 approach by considering  up to 1000-node networks, in contrast to most
 previous work on fully decentralized learning that considers only a few tens
 of nodes
@@ -237,23 +243,24 @@ thereby yielding a 96\% reduction in the total number of required messages
 % (14.5 edges per node on average instead of 18.9)
 is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.
-The rest of this paper is organized as follows. We first present the problem
+The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes}
-statement and our methodology (Section~\ref{section:problem}). The D-Cliques
+%We first present the problem
-design is presented in Section~\ref{section:d-cliques}) along with an
+%statement and our methodology (Section~\ref{section:problem}). The D-Cliques
-empirical illustration of its benefits. In
+%design is presented in Section~\ref{section:d-cliques}) along with an
-Section~\ref{section:clique-averaging-momentum}, we
+%empirical illustration of its benefits. In
-show how to further reduce bias with Clique Averaging and how to use it to
+%Section~\ref{section:clique-averaging-momentum}, we
-implement momentum.  We present the results of our extensive experimental
+%show how to further reduce bias with Clique Averaging and how to use it to
-study in  Section~\ref{section:non-clustered}. We review some related work in
+%implement momentum.  We present the results of our extensive experimental
- Section~\ref{section:related-work}, and conclude with promising directions
+%study in  Section~\ref{section:non-clustered}. We review some related work in
- for future work in Section~\ref{section:conclusion}.
+% Section~\ref{section:related-work}, and conclude with promising directions
+% for future work in Section~\ref{section:conclusion}.
 \section{Problem Statement}
 \label{section:problem}
 We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
-collaboratively solve a classification task with $c$ classes. Each node has access to a local dataset that
+collaboratively solve a classification task with $L$ classes. Each node has access to a local dataset that
 follows its own local distribution $D_i$. The goal is to find a global model
 $x$ that performs well on the union of the local distributions by minimizing
 the average training loss:
@@ -294,100 +301,21 @@ must be doubly
 stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and
 symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}.
-%\begin{algorithm}[t]
+\begin{algorithm}[t]
-%   \caption{D-SGD, Node $i$}
+   \caption{D-SGD, Node $i$}
-%   \label{Algorithm:D-PSGD}
+   \label{Algorithm:D-PSGD}
-%   \begin{algorithmic}[1]
+   \begin{algorithmic}[1]
-%        \State \textbf{Require:} initial model parameters $x_i^{(0)}$,
+        \STATE \textbf{Require:} initial model parameters $x_i^{(0)}$,
-%        learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
+        learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
-%        number of steps $K$
+        number of steps $K$
-%        \For{$k = 1,\ldots, K$}
+        \FOR{$k = 1,\ldots, K$}
-%          \State $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
+          \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
-%          from~} D_i$
+          from~} D_i$
-%          \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$ 
+          \STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$ 
-%          \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
+          \STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
-%        \EndFor
+        \ENDFOR
-%   \end{algorithmic}
+   \end{algorithmic}
-%\end{algorithm}
+\end{algorithm}
-\subsection{Methodology}
-\subsubsection{Non-IID assumptions.}
-\label{section:non-iid-assumptions}
-As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
-assumption of IID data significantly challenges the learning algorithm. In
-this paper, we focus on an \textit{extreme case of local class bias}: we
-consider that each node only has examples from a single class.
-To isolate the effect of local class bias from other potentially compounding
-factors, we make the following simplifying assumptions: (1) All classes are
-equally represented in the global dataset; (2) All classes are represented on
-the same number of nodes; (3) All nodes have the same number of examples.
-We believe that these assumptions are reasonable in the context of our study
-because: (1)
-Global class
-imbalance equally
-affects the optimization process on a single node and is therefore not
-specific to the decentralized setting; (2) Our results do not exploit specific
-positions in the topology;  (3) Imbalanced dataset sizes across nodes can be
-addressed for instance by appropriately weighting the individual loss
-functions. Our results can be extended to support additional compounding factors in future work.
-\subsubsection{Experimental setup.}
-\label{section:experimental-settings}
-Our main goal is to provide a fair comparison of the convergence speed across
-different topologies and algorithmic variations, in order to
-show that our approach
-can remove much of the effect of local class bias.
-We experiment with two datasets: MNIST~\cite{mnistWebsite} and
-CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
-For MNIST, we use 45k and 10k examples from the original 60k
-training set for training and validation respectively. The remaining 5k
-training examples were randomly removed to ensure all 10 classes are balanced
-while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
-We use all 10k examples of
-the test set to measure prediction accuracy. For CIFAR10, classes are evenly
-balanced: we use 45k/50k images of the original training set for training,
-5k/50k for validation, and all 10k examples of the test set for measuring
-prediction accuracy.
-We
-use a logistic regression classifier for MNIST, which
-provides up to 92.5\% accuracy in the centralized setting.
-For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
-deep convolutional network which achieves an accuracy of $72.3\%$ in the
-centralized setting.
-These models are thus reasonably accurate (which is sufficient to
-study the effect of the topology) while being sufficiently fast to train in a
-fully decentralized setting and simple enough to configure and analyze.
-Regarding hyper-parameters, we jointly optimize the learning rate and
-mini-batch size on the
-validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
-MNIST and $0.002$ and $20$ for CIFAR10.
-For CIFAR10, we additionally use a momentum of $0.9$.
-We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
-To ignore the impact of distributed execution strategies and system
-optimization techniques, we report the test accuracy of all nodes (min, max,
-average) as a function of the number of times each example of the dataset has
-been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
-To further make results comparable across different number of nodes, we lower
-the batch size proportionally to the number of nodes added, and inversely,
-e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
-ensures the same number of model updates and averaging per epoch, which is
-important to have a fair comparison.\footnote{Updating and averaging models
-after every example can eliminate the impact of local class bias. However, the
-resulting communication overhead is impractical.}
-Finally, we compare our results against an ideal baseline: either a
-fully-connected network topology with the same number of nodes or a single IID
-node. In both cases, the topology has no effect on
-the optimization. For a certain choice of number of nodes and
-mini-batch size, both approaches are equivalent. 
 \section{D-Cliques: Creating Locally Representative Cliques}
 \label{section:d-cliques}
@@ -433,17 +361,24 @@ impractical.
 In D-Cliques, we address the issues of non-iidness by carefully designing a
 network topology composed of \textit{cliques} and \textit{inter-clique
-connections}:
+connections}.
-\begin{itemize}
- \item  D-Cliques recover a balanced representation of classes, similar to
+First, D-Cliques recover a balanced representation of classes, close to
- that of the IID case, by constructing a topology such that each node is
+ that of the IID case, by constructing a topology such that each node $i \in N$ is
- part of a \textit{clique} with neighbors representing all classes.
+ part of a \textit{clique} $C$ such that the clique distribution $D_C = \bigcup\limits_{\substack{i \in C}} D_i$ is close to that of the global distribution $D = \bigcup\limits_{\substack{i \in N}} D_i$. We measure the closeness of $D_C$ to $D$ using its \textit{skew}, i.e. the sum of the differences in the probabilities that a sample $(x,y)$ belongs to the same class in $L$ in $D_C$ and $D$: 
- \item To ensure a global consensus and convergence, 
+\begin{equation}
+\label{eq:skew}
+\begin{split}
+        \textit{skew}(C) =\
+	\sum\limits_{\substack{l \in L}} | p(y = l~|(x,y) \in D_C) - \\ p(y = l~|(x,y) \in D) |
+\end{split}
+\end{equation}
+Second, to ensure a global consensus and convergence, 
 \textit{inter-clique connections}
 are introduced by connecting a small number of node pairs that are
- part of  different cliques.
+ part of  different cliques. In the following, we introduce up to one inter-clique 
-\end{itemize}
+ connection per node such that each clique has exactly one
-In the following, we introduce up to one inter-clique connection per node such that each clique has exactly one
 edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
 corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
 classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
@@ -461,57 +396,61 @@ topology, namely:
  \label{eq:metro}
 \end{equation}
+We construct D-Cliques by initializing cliques at random, using at most $M$
-We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
+nodes to limit the intra-clique communication costs, then we 
-for a formal account of D-Cliques construction. We note that it only requires
+swap nodes between pairs of cliques chosen at random such that the swap
+decreases the skew of that pair but keeps
+the size of the cliques constant (see Algorithm~\ref{Algorithm:D-Clique-Construction}).  
+Only swaps that decrease the skew are performed, hence this algorithm can be
+seen as a form of randomized greedy algorithm. We note that this algorithm only requires
 the knowledge of the local class distribution at each node. For the sake of
-simplicity, we assume that D-Cliques is constructed from the global
+simplicity, we assume that D-Cliques are constructed from the global
 knowledge of these distributions, which can easily be obtained by
 decentralized averaging in a pre-processing step. 
-The key idea of D-Cliques is that because the clique-level distribution $D_{
+\begin{algorithm}[h]
-\textit{clique}} = \sum_{i
+   \caption{D-Cliques Construction: Greedy Swap}
-\in \textit{clique}} D_i$ is representative of the global distribution,
+   \label{Algorithm:D-Clique-Construction}
+   \begin{algorithmic}[1]
+        \STATE \textbf{Require:} Max clique size $M$, Max steps $K$,
+        \STATE Set of all nodes $N = \{ 1, 2, \dots, n \}$,
+        \STATE $\textit{skew}(S)$: skew of subset $S \subseteq N$ compared to the global distribution (Eq.~\ref{eq:skew}), 
+        \STATE $\textit{intra}(DC)$: edges within cliques $C \in DC$,
+        \STATE $\textit{inter}(DC)$: edges between $C_1,C_2 \in DC$ (Sec.~\ref{section:interclique-topologies}),
+         \STATE $\textit{weights}(E)$: set weights to edges in $E$ (Eq.~\ref{eq:metro}).
+         \STATE ~~
+         \STATE $DC \leftarrow []$ \COMMENT{Empty list}
+         \WHILE {$N \neq \emptyset$}
+         \STATE $C \leftarrow$ sample $M$ nodes from $N$ at random
+         \STATE $N \leftarrow N \setminus C$; $DC.append(C)$
+         \ENDWHILE
+         \FOR{$k \in \{1, \dots, K\}$}
+		\STATE $C_1,C_2 \leftarrow$ sample 2 from $DC$ at random
+		\STATE $\textit{swaps} \leftarrow []$
+		\FOR{$n_1 \in C_1, n_2 \in C_2$}
+		  \STATE $s \leftarrow skew(C_1) + skew(C_2)$
+		  \STATE $s' \leftarrow \textit{skew}(C_1-n_1+n_2) + \textit{skew}(C_2 -n_2+n_1)$
+		  \IF {$s' < s$}
+		    \STATE \textit{swaps}.append($(n_1, n_2)$)
+		  \ENDIF
+		\ENDFOR
+		\IF {\#\textit{swaps} $> 0$}
+		  \STATE $(n_1,n_2) \leftarrow$ sample 1 from $\textit{swaps}$ at random
+		  \STATE $C_1 \leftarrow C_1 - n_1 + n_2; C_2 \leftarrow C_2 - n_2 + n1$
+		\ENDIF
+         \ENDFOR
+        \RETURN $(weights(\textit{intra}(DC) \cup \textit{inter}(DC)), DC)$
+   \end{algorithmic}
+\end{algorithm}
+The key idea of D-Cliques is that because the clique-level distribution $D_C$
+ is representative of the global distribution $D$,
 the local models of nodes across cliques remain rather close. Therefore, a
 sparse inter-clique topology can be used, significantly reducing the total
 number of edges without slowing down the convergence. Furthermore, the degree
 of each node in the network remains low and even, making the D-Cliques
 topology very well-suited to decentralized federated learning. 
-\begin{figure}[t]
-    \centering 
-    \begin{subfigure}[b]{0.20\textwidth}
-    \centering
-    \includegraphics[width=\textwidth]{../figures/fully-connected-cliques}
-    \caption{\label{fig:d-cliques-figure} D-Cliques (fully-connected
-    cliques)}
-    \end{subfigure}
-    \hfill
-    % To regenerate figure, from results/mnist
-    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16 --linestyles 'solid' 'dashed'
-    \begin{subfigure}[b]{0.26\textwidth}
-    \centering
-    \includegraphics[width=\textwidth]{../figures/d-cliques-mnist-vs-fully-connected.png}
-    \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed
-    on MNIST}
-    \end{subfigure}
-\caption{\label{fig:d-cliques-example} D-Cliques topology and convergence
-speed on MNIST.}
-\end{figure}
-Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
-performance of D-Cliques on MNIST with $n=100$ nodes. Observe that the
-convergence speed is
-very close
-to that of a fully-connected topology, and significantly better than with
-a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 
-100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
-compared to a fully-connected topology. Nonetheless, there is still
-significant variance in the accuracy across nodes, which is due to the bias
-introduced by inter-clique edges. We address this issue in the next section.
 \section{Optimizing with Clique Averaging and Momentum}
 \label{section:clique-averaging-momentum}
@@ -560,37 +499,23 @@ providing an equal representation to all classes. In contrast, all neighbors'
 models, including those across inter-clique edges, participate in the model
 averaging step as in the original version.
-%\begin{algorithm}[t]
+\begin{algorithm}[t]
-%   \caption{D-SGD with Clique Averaging, Node $i$}
+   \caption{D-SGD with Clique Averaging, Node $i$}
-%   \label{Algorithm:Clique-Unbiased-D-PSGD}
+   \label{Algorithm:Clique-Unbiased-D-PSGD}
-%   \begin{algorithmic}[1]
+   \begin{algorithmic}[1]
-%        \State \textbf{Require} initial model parameters $x_i^{(0)}$, learning
+        \STATE \textbf{Require} initial model parameters $x_i^{(0)}$, learning
-%        rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
+        rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
-%        steps $K$
+        steps $K$
-%        \For{$k = 1,\ldots, K$}
+        \FOR{$k = 1,\ldots, K$}
-%          \State $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
+         \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
-%          from~} D_i$
+          from~} D_i$
-%          \State $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}}  \nabla F(x_j^{(k-1)}; s_j^{(k)})$
+          \STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}}  \nabla F(x_j^{(k-1)}; s_j^{(k)})$
-%          \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$ 
+          \STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$ 
-%          \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
+          \STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
-%        \EndFor
+        \ENDFOR
-%   \end{algorithmic}
+   \end{algorithmic}
-%\end{algorithm}
+\end{algorithm}
-% To regenerate figure, from results/mnist:
-% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
-\begin{figure}[t]
-         \centering
-         \includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
-\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
-\end{figure}
-As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
-significantly reduces the variance of models across nodes and accelerates
-convergence to reach the same level as the one obtained with a
-fully-connected topology. Note that Clique Averaging induces a small
-additional cost, as gradients
-and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
 \subsection{Implementing Momentum with Clique Averaging}
 \label{section:momentum}
@@ -607,6 +532,158 @@ even fails to converge. Not using momentum actually gives a faster
 convergence, but there is a significant gap compared to the case of a single
 IID node with momentum.
+Clique Averaging (Section~\ref{section:clique-averaging})
+allows us to compute an unbiased momentum from the
+unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
+\begin{equation}
+v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)} 
+\end{equation}
+It then suffices to modify the original gradient step to use momentum:
+\begin{equation}
+x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} 
+\end{equation}
+\section{Comparative Evaluation and Extensions}
+\label{section:non-clustered}
+In this section, we first compare D-Cliques to alternative topologies to
+confirm the relevance of our main design choices. Then,
+we evaluate some extensions of D-Cliques to further reduce the number of
+inter-clique connections so as to gracefully scale with the number of
+nodes.
+\subsection{Methodology}
+\subsubsection{Non-IID assumptions.}
+\label{section:non-iid-assumptions}
+As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
+assumption of IID data significantly challenges the learning algorithm. In
+this paper, we focus on an \textit{extreme case of local class bias}: we
+consider that each node only has examples from a single class.
+To isolate the effect of local class bias from other potentially compounding
+factors, we make the following simplifying assumptions: (1) All classes are
+equally represented in the global dataset; (2) All classes are represented on
+the same number of nodes; (3) All nodes have the same number of examples.
+We believe that these assumptions are reasonable in the context of our study
+because: (1)
+Global class
+imbalance equally
+affects the optimization process on a single node and is therefore not
+specific to the decentralized setting; (2) Our results do not exploit specific
+positions in the topology;  (3) Imbalanced dataset sizes across nodes can be
+addressed for instance by appropriately weighting the individual loss
+functions. Our results can be extended to support additional compounding factors in future work.
+\subsubsection{Experimental setup.}
+\label{section:experimental-settings}
+Our main goal is to provide a fair comparison of the convergence speed across
+different topologies and algorithmic variations, in order to
+show that our approach
+can remove much of the effect of local class bias.
+We experiment with two datasets: MNIST~\cite{mnistWebsite} and
+CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
+For MNIST, we use 45k and 10k examples from the original 60k
+training set for training and validation respectively. The remaining 5k
+training examples were randomly removed to ensure all 10 classes are balanced
+while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
+We use all 10k examples of
+the test set to measure prediction accuracy. For CIFAR10, classes are evenly
+balanced: we use 45k/50k images of the original training set for training,
+5k/50k for validation, and all 10k examples of the test set for measuring
+prediction accuracy.
+We
+use a logistic regression classifier for MNIST, which
+provides up to 92.5\% accuracy in the centralized setting.
+For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
+deep convolutional network which achieves an accuracy of $72.3\%$ in the
+centralized setting.
+These models are thus reasonably accurate (which is sufficient to
+study the effect of the topology) while being sufficiently fast to train in a
+fully decentralized setting and simple enough to configure and analyze.
+Regarding hyper-parameters, we jointly optimize the learning rate and
+mini-batch size on the
+validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
+MNIST and $0.002$ and $20$ for CIFAR10.
+For CIFAR10, we additionally use a momentum of $0.9$.
+We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
+To ignore the impact of distributed execution strategies and system
+optimization techniques, we report the test accuracy of all nodes (min, max,
+average) as a function of the number of times each example of the dataset has
+been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
+To further make results comparable across different number of nodes, we lower
+the batch size proportionally to the number of nodes added, and inversely,
+e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
+ensures the same number of model updates and averaging per epoch, which is
+important to have a fair comparison.\footnote{Updating and averaging models
+after every example can eliminate the impact of local class bias. However, the
+resulting communication overhead is impractical.}
+Finally, we compare our results against an ideal baseline: either a
+fully-connected network topology with the same number of nodes or a single IID
+node. In both cases, the topology has no effect on
+the optimization. For a certain choice of number of nodes and
+mini-batch size, both approaches are equivalent. 
+\subsection{Main Results}
+\begin{figure}[t]
+    \centering 
+    \begin{subfigure}[b]{0.20\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{../figures/fully-connected-cliques}
+    \caption{\label{fig:d-cliques-figure} D-Cliques (fully-connected
+    cliques)}
+    \end{subfigure}
+    \hfill
+    % To regenerate figure, from results/mnist
+    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16 --linestyles 'solid' 'dashed'
+    \begin{subfigure}[b]{0.26\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{../figures/d-cliques-mnist-vs-fully-connected.png}
+    \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed
+    on MNIST}
+    \end{subfigure}
+\caption{\label{fig:d-cliques-example} D-Cliques topology and convergence
+speed on MNIST.}
+\end{figure}
+Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
+performance of D-Cliques on MNIST with $n=100$ nodes. Observe that the
+convergence speed is
+very close
+to that of a fully-connected topology, and significantly better than with
+a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 
+100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
+compared to a fully-connected topology. Nonetheless, there is still
+significant variance in the accuracy across nodes, which is due to the bias
+introduced by inter-clique edges. We address this issue in the next section.
+\subsection{Importance of Clique Averaging and Momentum}
+% To regenerate figure, from results/mnist:
+% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
+\begin{figure}[t]
+         \centering
+         \includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
+\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
+\end{figure}
+As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
+significantly reduces the variance of models across nodes and accelerates
+convergence to reach the same level as the one obtained with a
+fully-connected topology. Note that Clique Averaging induces a small
+additional cost, as gradients
+and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
 \begin{figure}[t]
    \centering 
    % To regenerate figure, from results/cifar10
@@ -627,31 +704,11 @@ IID node with momentum.
 \caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
 \end{figure}
-We show here that Clique Averaging (Section~\ref{section:clique-averaging})
-allows us to compute an unbiased momentum from the
-unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
-\begin{equation}
-v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)} 
-\end{equation}
-It then suffices to modify the original gradient step to use momentum:
-\begin{equation}
-x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} 
-\end{equation}
 As shown in
-Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this
+Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, 
-simple modification restores the benefits of momentum and closes the gap
+the use of Clique Averaging restores the benefits of momentum and closes the gap
 with the centralized setting.
-\section{Comparative Evaluation and Extensions}
-\label{section:non-clustered}
-In this section, we first compare D-Cliques to alternative topologies to
-confirm the relevance of our main design choices. Then,
-we evaluate some extensions of D-Cliques to further reduce the number of
-inter-clique connections so as to gracefully scale with the number of
-nodes.
 \subsection{Comparing D-Cliques to Other Sparse Topologies}
 We demonstrate the advantages of D-Cliques over alternative sparse topologies
@@ -872,6 +929,11 @@ show that D-Cliques can nicely scale with the number of nodes.
 \caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
 \end{figure*}
+\subsection{Cost of Constructing Cliques}
+\label{section:cost-cliques}
+\dots \todo{EL: Add plots showing convergence speed in terms of skew vs iteration number, as well as absolute computation time}
 \section{Related Work}
 \label{section:related-work}
@@ -991,18 +1053,16 @@ approximately recover the global distribution.
 \appendix
 \section{Detailed Algorithms}
- We present a more detailed and precise explanation of the two main algorithms
+ We present a more detailed and precise explanation the algorithm to establish a small-world
- of the paper, for D-Cliques construction
- (Algorithm~\ref{Algorithm:D-Clique-Construction}) and to establish a small-world
 inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}).
- \subsection{D-Cliques Construction}
+% \subsection{D-Cliques Construction}
+% 
- Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
+% Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
- for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
+% for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
- version of D-Cliques, in which each node has an equal number of examples of
+% version of D-Cliques, in which each node has an equal number of examples of
- all classes, can be implemented by picking $\#L$ nodes per clique at random.}
+% all classes, can be implemented by picking $\#L$ nodes per clique at random.}
- It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$.  Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
+% It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$.  Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
 %   \begin{algorithm}[h]
 %   \caption{D-Cliques Construction}
@@ -1045,34 +1105,33 @@ Algorithm~\ref{Algorithm:Smallworld} instantiates the function
 small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
 linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
-%\begin{algorithm}[h]
+\begin{algorithm}[h]
-%   \caption{$\textit{smallworld}(DC)$:  adds $O(\# N \log(\# N))$ edges}
+   \caption{$\textit{smallworld}(DC)$:  adds $O(\# N \log(\# N))$ edges}
-%   \label{Algorithm:Smallworld}
+   \label{Algorithm:Smallworld}
-%   \begin{algorithmic}[1]
+   \begin{algorithmic}[1]
-%        \State \textbf{Require:} set of cliques $DC$ (set of set of nodes)
+        \STATE \textbf{Require:} set of cliques $DC$ (set of set of nodes)
-%        \State ~~size of neighborhood $ns$ (default 2)
+        \STATE ~~size of neighborhood $ns$ (default 2)
-%        \State ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
+        \STATE ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
-%        \State $E \leftarrow \emptyset$ \Comment{Set of Edges}
+        \STATE $E \leftarrow \emptyset$ \COMMENT{Set of Edges}
-%        \State $L \leftarrow [ C~\text{for}~C \in DC ]$ \Comment{Arrange cliques in a list}
+        \STATE $L \leftarrow [ C~\text{for}~C \in DC ]$ \COMMENT{Arrange cliques in a list}
-%        \For{$i \in \{1,\dots,\#DC\}$} \Comment{For every clique}
+        \FOR{$i \in \{1,\dots,\#DC\}$} % \COMMENT{For every clique}
-%          \State \Comment{For sets of cliques exponentially further away from $i$}
+%          %\STATE ~\COMMENT{For sets of cliques exponentially further away from $i$}
-%          \For{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots,
+          \FOR{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots, \lceil \log_2(\#DC) \rceil \} \}$} 
-%          \lceil \log_2(\#DC) \rceil \} \}$} 
+%             \STATE \COMMENT{Pick the $ns$ closest}
-%             \State \Comment{Pick the $ns$ closest}
+             \FOR{$k \in \{0,\dots,ns-1\}$}
-%             \For{$k \in \{0,\dots,ns-1\}$}
+%                 \STATE \COMMENT{Add inter-clique connections in both directions}
-%                 \State \Comment{Add inter-clique connections in both directions}
+                 \STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
-%                 \State $n \leftarrow \textit{least\_edges}(L_i, E)$
+                 \STATE $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ %\COMMENT{clockwise in ring}
-%                 \State $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ \Comment{clockwise in ring}
+                 \STATE $E \leftarrow E \cup \{ \{n,m\} \}$
-%                 \State $E \leftarrow E \cup \{ \{n,m\} \}$
+                 \STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
-%                 \State $n \leftarrow \textit{least\_edges}(L_i, E)$
+                 \STATE $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ %\COMMENT{counter-clockwise in ring}
-%                 \State $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ \Comment{counter-clockwise in ring}
+                 \STATE $E \leftarrow E \cup \{ \{n,m\} \}$
-%                 \State $E \leftarrow E \cup \{ \{n,m\} \}$
+             \ENDFOR
-%             \EndFor
+           \ENDFOR
-%           \EndFor
+        \ENDFOR
-%        \EndFor
+         \RETURN E
-%        \State \Return E
+   \end{algorithmic}
-%   \end{algorithmic}
+\end{algorithm}
-%\end{algorithm}
 Algorithm~\ref{Algorithm:Smallworld}  expects a set of cliques $DC$, previously computed by 
 Algorithm~\ref{Algorithm:D-Clique-Construction}; a size of neighborhood $ns$,