diff --git a/main.tex b/main.tex
index 4bb44f5b5f0e543b692767759d3af198be213baa..3ba9c38420899ce30f7a77b466906e58a4a3c200 100644
--- a/main.tex
+++ b/main.tex
@@ -53,8 +53,7 @@ EPFL, Lausanne, Switzerland \\
 \maketitle              % typeset the header of the contribution
 %
 \begin{abstract}
-The abstract should briefly summarize the contents of the paper in
-150--250 words.
+The convergence speed of machine learning models trained with Federated Learning is significantly affected by non-identically and independently distributed (non-IID) data partitions, even more so in a fully decentralized (serverless) setting. We propose the D-Cliques topology, which reduces gradient bias by grouping nodes in cliques such that their local joint distribution is representative of the global distribution. D-Cliques provide similar convergence speed as a fully-connected topology, both in IID and non-IID settings, with a significant reduction in the number of required edges and messages: at a scale of 1000 nodes, 98\% less edges and 96\% less total messages. We show how D-Cliques can be used to successfully implement momentum, critical to quickly train deep convolutional networks but otherwise detrimental in a non-IID setting. We finally show that, among many possible inter-clique topologies, a small-world topology that scales the number of edges logarithmically in the number of nodes this provides a further 22\% reduction in the number of edges compared to fully connecting cliques with a single edge pairwise at 1000 nodes, and suggests bigger possible gains at larger scales.
 
 \keywords{Decentralized Learning \and Federated Learning \and Topology \and
 Non-IID Data \and Stochastic Gradient Descent}
@@ -146,13 +145,12 @@ applications.
 \aurelien{TODO: complete above paragraph with more details and highlighting
 other contributions as needed}
 
-
-
 To summarize, our contributions are as follows:
 \begin{enumerate}
-\item TODO
-\item 
-\item 
+  \item we show the significant impact of topology on convergence speed in the presence of non-IID data in decentralized learning;
+  \item we propose the D-Cliques topology to remove the impact of non-IID data on convergence speed, similar to a fully-connected topology. At a scale of 1000 nodes, this represents a 98\% reduction in the number of edges ($18.9$ vs $999$ edges per node on average) and a 96\% reduction in the total number of required messages;
+  \item we show how to leverage D-Cliques to implement momentum, a critical optimization technique to quickly train convolutional networks, that otherwise significantly \textit{decreases} convergence speed in the presence of non-IID data;
+  \item we show that, among the many possible choices of inter-clique topologies, a smallworld topology provides a convergence speed close to fully-connecting all cliques pairwise, but requires only $O(n + log(n))$ instead of $O(n^2)$ edges where $n$ is the number of nodes. At a scale of 1000 nodes, this represents a 22\% reduction in the number of edges compared to fully-connecting cliques ($14.6$ vs $18.9$ edges per node on average) and suggests possible bigger gains at larger scales. 
 \end{enumerate}
 
 The rest of this paper is organized as follows. \aurelien{TO COMPLETE}
@@ -187,61 +185,6 @@ The rest of this paper is organized as follows. \aurelien{TO COMPLETE}
 
 \footnotetext{This is different from the accuracy of the average model across nodes that is sometimes used once training is completed.}
 
-\subsection{Bias in Gradient Averaging with Non-IID Data}
-
-\aurelien{I think this should go into the approach section, to motivate it.
-In the introduction, maybe we can just give the main intuitions in a few
-sentences?}
-
-To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}, as illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. The color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In  this (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid, with neighbourhood such as this one, sometimes having nodes with examples of the same class adjacent to each other.
-
-\begin{figure}
-     \centering
-     \begin{subfigure}[b]{0.33\textwidth}
-         \centering
-         \includegraphics[width=\textwidth]{figures/grid-iid-neighbourhood}
-\caption{\label{fig:grid-iid-neighbourhood} IID}
-     \end{subfigure}
-     \begin{subfigure}[b]{0.33\textwidth}
-         \centering
-         \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
-\caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
-     \end{subfigure}
-        \caption{Neighbourhood in an IID and non-IID Grid.}
-        \label{fig:grid-iid-vs-non-iid-neighbourhood}
-\end{figure}
-
-For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence in an IID setting but makes the comparison easier. A single training step, from the point of view of the middle node of the illustrated neighbourhood, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes. 
-
-In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. There is some variance remaining from the difference between examples but in practice it has a sufficiently small impact on convergence speed that there are still benefits from parallelizing the computations.
-
-However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. As the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical.
-
-
-\subsection{D-Cliques}
-
-\aurelien{this should definitely go to approach section}
-
-     \begin{figure}[htbp]
-         \centering
-         \includegraphics[width=0.4\textwidth]{figures/fully-connected-cliques}
-\caption{\label{fig:d-cliques-example} D-Cliques: Connected Cliques of Dissimilar Nodes, Locally Representative of the Global Distribution}
-     \end{figure}
-
-If we relax the constraint of regularity, a trivial solution is a star topology, as used in most Federated Learning implementations (CITE) at the expense of a high requirement on reliability and available bandwidth on the central node. We instead propose a regular topology, built around \textit{cliques} of dissimilar nodes, locally representative of the global distribution and connected by few links, as illustrated in Figure~\ref{fig:d-cliques-example}. D-Cliques enable similar convergence speed as a fully connected topology, using a number of edges that grows sub-exponentially ($O(nc + \frac{n^2}{c^2})$ where $n$ is the number of nodes and $c$ is the size of a clique\footnote{$O((\frac{n}{c})c^2 + (\frac{n}{c})^2)$, i.e. number of cliques times the number of edges within cliques (squared in the size of cliques) in addition to inter-cliques edges (square of the number of cliques).}.), instead of exponentially in the number of nodes ($O(n^2)$), with a corresponding reduction in bandwidth usage and required number of messages per round of training. In practice, for the cases with networks of size 100 we have tested, that corresponds to a reduction in the number of edges of 90\%. (TODO: Do analysis if the pattern is fractal with three levels at 1000 nodes: cliques, 10 cliques connected pairwise in a "region", and each "region" connected pairwise with other regions)
-
-Because the data distribution within each clique is representative of the global distribution, we can recover optimization techniques that rely on an IID assumption, in a distributed setting that is not. As one example, we show how momentum (CITE) can be used with D-Cliques to greatly improve convergence speed of convolutional networks, as in a centralized IID setting, even though the technique is otherwise \textit{detrimental} in a more general non-IID setting. 
-
-As a summary, we make the following contributions:
-\begin{itemize}
-  \item significant impact of topology on non-iid data
-  \item we propose the D-Cliques topology to remove the impact of non-IID data on convergence speed, similar to a fully-connected topology, with a reduced number of edges and required messages
-  \item we show how to leverage D-Cliques to implement momentum in a distributed non-IID setting, which would otherwise be detrimental to the convergence speed of convolutional networks
-  \item scale (>16 noeuds)
-\end{itemize}
-
-The rest of the paper is organized as such. \dots
-
 \section{Problem Statement}
 
 \label{section:problem}
@@ -289,9 +232,50 @@ D-PSGD can be used with a variety of models, including deep learning networks. T
 %
 %From the perspective of one node, \textit{clustering} intuitively represents how many connections exist between its immediate neighbours. A high level of clustering means that neighbours have many edges between each other. The highest level is a \textit{clique}, where all nodes in the neighbourhood are connected to one another. Formally, the level of clustering, between $0$ and $1$, is the ratio of $\frac{\textit{nb edges between neighbours}}{\textit{nb possible edges}}$~\cite{watts2000small}.
 %
+\section{Motivation: Bias in Gradient Averaging with Non-IID Data}
+
+\aurelien{I think this should go into the approach section, to motivate it.
+In the introduction, maybe we can just give the main intuitions in a few
+sentences?}
+
+To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}, as illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. The color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In  this (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid, with neighbourhood such as this one, sometimes having nodes with examples of the same class adjacent to each other.
+
+\begin{figure}
+     \centering
+     \begin{subfigure}[b]{0.33\textwidth}
+         \centering
+         \includegraphics[width=\textwidth]{figures/grid-iid-neighbourhood}
+\caption{\label{fig:grid-iid-neighbourhood} IID}
+     \end{subfigure}
+     \begin{subfigure}[b]{0.33\textwidth}
+         \centering
+         \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
+\caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
+     \end{subfigure}
+        \caption{Neighbourhood in an IID and non-IID Grid.}
+        \label{fig:grid-iid-vs-non-iid-neighbourhood}
+\end{figure}
+
+For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence in an IID setting but makes the comparison easier. A single training step, from the point of view of the middle node of the illustrated neighbourhood, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes. 
+
+In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. There is some variance remaining from the difference between examples but in practice it has a sufficiently small impact on convergence speed that there are still benefits from parallelizing the computations.
 
+However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. As the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical.
+
+
+\subsection{D-Cliques}
 
-\section{D-Cliques}
+\aurelien{this should definitely go to approach section}
+
+     \begin{figure}[htbp]
+         \centering
+         \includegraphics[width=0.4\textwidth]{figures/fully-connected-cliques}
+\caption{\label{fig:d-cliques-example} D-Cliques: Connected Cliques of Dissimilar Nodes, Locally Representative of the Global Distribution}
+     \end{figure}
+
+If we relax the constraint of regularity, a trivial solution is a star topology, as used in most Federated Learning implementations (CITE) at the expense of a high requirement on reliability and available bandwidth on the central node. We instead propose a regular topology, built around \textit{cliques} of dissimilar nodes, locally representative of the global distribution and connected by few links, as illustrated in Figure~\ref{fig:d-cliques-example}. D-Cliques enable similar convergence speed as a fully connected topology, using a number of edges that grows sub-exponentially ($O(nc + \frac{n^2}{c^2})$ where $n$ is the number of nodes and $c$ is the size of a clique\footnote{$O((\frac{n}{c})c^2 + (\frac{n}{c})^2)$, i.e. number of cliques times the number of edges within cliques (squared in the size of cliques) in addition to inter-cliques edges (square of the number of cliques).}.), instead of exponentially in the number of nodes ($O(n^2)$), with a corresponding reduction in bandwidth usage and required number of messages per round of training. In practice, for the cases with networks of size 100 we have tested, that corresponds to a reduction in the number of edges of 90\%. (TODO: Do analysis if the pattern is fractal with three levels at 1000 nodes: cliques, 10 cliques connected pairwise in a "region", and each "region" connected pairwise with other regions)
+
+Because the data distribution within each clique is representative of the global distribution, we can recover optimization techniques that rely on an IID assumption, in a distributed setting that is not. As one example, we show how momentum (CITE) can be used with D-Cliques to greatly improve convergence speed of convolutional networks, as in a centralized IID setting, even though the technique is otherwise \textit{detrimental} in a more general non-IID setting. 
 
 Three Main ideas:
 \begin{itemize}
@@ -367,7 +351,7 @@ We solve this problem by decoupling the gradient averaging from the weight avera
 % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET fully-connected/all/2021-03-10-09:25:19-CET clique-ring/all/2021-03-10-18:14:35-CET fully-connected-cliques/all/2021-03-10-10:19:44-CET --add-min-max --yaxis test-accuracy --labels '1-node IID bsz=12800' '100-nodes non-IID fully-connected bsz=128' '100-nodes non-IID D-Cliques (Ring) bsz=128' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=128' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-vs-1-node-test-accuracy.png
          \centering
          \includegraphics[width=0.7\textwidth]{figures/d-cliques-mnist-vs-1-node-test-accuracy}
-         \caption{\label{fig:d-cliques-mnist-linear-w-clique-averaging-w-initial-averaging} MNIST: D-Cliques Convergence Speed (100 nodes)}
+         \caption{\label{fig:d-cliques-mnist-linear-w-clique-averaging-w-initial-averaging} MNIST: D-Cliques Convergence Speed (100 nodes, Constant Updates per Epoch)}
         \end{figure}
         
  % To regenerate the figure, from directory results/mnist
@@ -375,7 +359,7 @@ We solve this problem by decoupling the gradient averaging from the weight avera
              \begin{figure}[htbp]
      \centering
             \includegraphics[width=0.7\textwidth]{figures/d-cliques-mnist-1000-nodes-comparison}
-             \caption{\label{fig:d-cliques-mnist-1000-nodes-comparison} MNIST: D-Clique Convergence Speed  (1000 nodes)}
+             \caption{\label{fig:d-cliques-mnist-1000-nodes-comparison} MNIST: D-Clique Convergence Speed  (1000 nodes, Constant Updates per Epoch)}
      \end{figure}
      
     \begin{figure}[htbp]
@@ -535,7 +519,7 @@ In addition, it is important that all nodes are initialized with the same model
          \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-test-accuracy}
 \caption{\label{fig:d-cliques-cifar10-test-accuracy}  Test Accuracy}
      \end{subfigure}
-\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (100 nodes).}
+\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (100 nodes, Constant Updates per Epoch).}
 \end{figure}
 
 
@@ -556,7 +540,7 @@ In addition, it is important that all nodes are initialized with the same model
          \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy}
 \caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy}  Test Accuracy}
      \end{subfigure}
-\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (1000 nodes).}
+\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (1000 nodes, Constant Updates per Epoch).}
 \end{figure}
 
 
@@ -747,14 +731,14 @@ that would otherwise bias the direction of the gradient.
 % with variance reduction) or multiple averaging steps.
 
 
-\section{Future Work}
-\begin{itemize}
-  \item Non-uniform Class Representation
-  \item End-to-End Wall-Clock Training Time, including Clique Formation
-  \item Comparison to Shuffling Data in a Data Center
-  \item Behaviour in the Presence of Churn
-  \item Relaxing Clique Connectivity: Randomly choose a subset of clique neighbours to compute average gradient.
-\end{itemize}
+%\section{Future Work}
+%\begin{itemize}
+%  \item Non-uniform Class Representation
+%  \item End-to-End Wall-Clock Training Time, including Clique Formation
+%  \item Comparison to Shuffling Data in a Data Center
+%  \item Behaviour in the Presence of Churn
+%  \item Relaxing Clique Connectivity: Randomly choose a subset of clique neighbours to compute average gradient.
+%\end{itemize}
 
 \section{Conclusion}