Started refactoring of experiments section

0c409353 · Erick Lavoie · 5323f5e1 · 0c409353 · 0c409353 · 0c409353
Commit 0c409353 authored 3 years ago by Erick Lavoie
--- a/mlsys2022style/appendix.tex
+++ b/mlsys2022style/appendix.tex
 % !TEX root = main.tex
 \appendix
-\section{Detailed Algorithms}
- We present a more detailed and precise explanation the algorithm to establish a small-world
- inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}).
-% \subsection{D-Cliques Construction}
-% 
-% Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
-% for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
-% version of D-Cliques, in which each node has an equal number of examples of
-% all classes, can be implemented by picking $\#L$ nodes per clique at random.}
-% It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$.  Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
-%   \begin{algorithm}[h]
-%   \caption{D-Cliques Construction}
-%   \label{Algorithm:D-Clique-Construction}
-%   \begin{algorithmic}[1]
-%        \State \textbf{Require:} set of classes globally present $L$, 
-%        \State~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
-%        \State~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
-%        \State~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
-%        \State~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
-%         \State~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$ 
-%         
-%        \State $R \leftarrow \{ n~\text{for}~n \in N \}$ \Comment{Remaining nodes}
-%        \State $DC \leftarrow \emptyset$ \Comment{D-Cliques}
-%        \State $\textit{C} \leftarrow \emptyset$ \Comment{Current Clique}
-%        \While{$R \neq \emptyset$}
-%       \State $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
-%       \State $R \leftarrow R \setminus \{ n \}$
-%       \State $C \leftarrow C \cup \{ n \}$
-%       \If{$\textit{classes}(C) = L$}
-%           \State $DC \leftarrow DC \cup \{ C \}$
-%           \State $C \leftarrow \emptyset$
-%       \EndIf
-%        \EndWhile
-%        \State \Return $(weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC)), DC)$
-%   \end{algorithmic}
-%\end{algorithm}
-The implementation builds a single clique by adding nodes with different
-classes until all classes of the global distribution are represented. Each
-clique is built sequentially until all nodes are parts of cliques.
-Because all classes are represented on an equal number of nodes, all cliques
-will have nodes of all classes. Furthermore, since nodes have examples
-of a single class, we are guaranteed a valid assignment is possible in a greedy manner. After cliques are created, edges are added and weights are assigned to edges, using the corresponding input functions.
-\subsection{Small-world Inter-clique Topology}
-Algorithm~\ref{Algorithm:Smallworld} instantiates the function 
+\section{Small-world Inter-clique Topology}
+ We present a more detailed and precise explanation the algorithm to establish a small-world
+ inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}). Algorithm~\ref{Algorithm:Smallworld} instantiates the function 
 \textit{interconnect} with a
 small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
 linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
@@ -327,6 +283,81 @@ efficient large-scale decentralized learning in practice.
 %         \centering
 %         \includegraphics[width=0.48\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-bsz}
 %         \caption{FCC: Constant Batch-Size}
-%     \end{figure} 
+%     \end{figure}
+\section{Additional Experiments with Extreme Node Skew}
+\label{app:extreme-local-skew} 
+In this Section, we present additional results for similar experiments as in Section~\ref{section:evaluation} but in the presence of
+ \textit{extreme local class bias}: we consider that each node only has examples from a single class. This extreme partitioning case provides an upper bound on the effect of label distribution skew suggesting that D-Cliques should perform similarly or better in less extreme cases, as long as a small-enough average skew can be obtained on all cliques. In turn, this helps to provide insights on why D-Cliques work well, as well as to quantify the loss in convergence speed
+that may result from using construction algorithms that generate cliques with higher skew.
+\subsection{Non-IID assumptions.}
+\label{section:non-iid-assumptions}
+To isolate the effect of local class bias from other potentially compounding
+factors, we make the following simplifying assumptions: (1) All classes are
+equally represented in the global dataset; (2) All classes are represented on
+the same number of nodes; (3) All nodes have the same number of examples.
+While less realistic than the assumptions used Section~\ref{section:evaluation}, 
+these assumptions are still reasonable because: (1) Global class imbalance equally
+affects the optimization process on a single node and is therefore not
+specific to the decentralized setting; (2) Our results do not exploit specific
+positions in the topology;  (3) Imbalanced dataset sizes across nodes can be
+addressed for instance by appropriately weighting the individual loss
+functions.
+These assumptions do make the construction of cliques actually easier by 
+making it easy to build cliques that have zero skew, as shown in 
+Section~\ref{section:ideal-cliques}. 
+\subsection{Constructing Ideal Cliques}
+\label{section:ideal-cliques}
+ Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
+ for constructing a D-Cliques topology under the assumptions of Section~\ref{section:non-iid-assumptions}.\footnote{An IID
+ version of D-Cliques, in which each node has an equal number of examples of
+ all classes, can be implemented by picking $\#L$ nodes per clique at random.}
+ It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$.  Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
+   \begin{algorithm}[h]
+   \caption{D-Cliques Construction}
+   \label{Algorithm:D-Clique-Construction}
+   \begin{algorithmic}[1]
+        \STATE \textbf{Require:} set of classes globally present $L$, 
+        \STATE~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
+        \STATE~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
+        \STATE~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
+        \STATE~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
+         \STATE~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$ 
+        \STATE $R \leftarrow \{ n~\text{for}~n \in N \}$ \COMMENT{Remaining nodes}
+        \STATE $DC \leftarrow \emptyset$ \COMMENT{D-Cliques}
+        \STATE $\textit{C} \leftarrow \emptyset$ \COMMENT{Current Clique}
+        \WHILE{$R \neq \emptyset$}
+       \STATE $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
+       \STATE $R \leftarrow R \setminus \{ n \}$
+       \STATE $C \leftarrow C \cup \{ n \}$
+       \IF{$\textit{classes}(C) = L$}
+           \STATE $DC \leftarrow DC \cup \{ C \}$
+           \STATE $C \leftarrow \emptyset$
+       \ENDIF
+        \ENDWHILE
+        \RETURN $(weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC)), DC)$
+   \end{algorithmic}
+\end{algorithm}
+The implementation builds a single clique by adding nodes with different
+classes until all classes of the global distribution are represented. Each
+clique is built sequentially until all nodes are parts of cliques.
+Because all classes are represented on an equal number of nodes, all cliques
+will have nodes of all classes. Furthermore, since nodes have examples
+of a single class, we are guaranteed a valid assignment is possible in a greedy manner. 
+After cliques are created, edges are added and weights are assigned to edges, 
+using the corresponding input functions.
+\subsection{Evaluation}
+\label{section:ideal-cliques-evaluation}
\ No newline at end of file
--- a/mlsys2022style/d-cliques.tex
+++ b/mlsys2022style/d-cliques.tex
@@ -217,4 +217,44 @@ v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)}.
 It then suffices to modify the original gradient step to use momentum:
 \begin{equation}
 \theta_i^{(k-\frac{1}{2})} \leftarrow \theta_i^{(k-1)} - \gamma v_i^{(k)}.
 \end{equation}
\ No newline at end of file
+\section{Scaling the Interclique Topology}
+\label{section:interclique-topologies}
+So far, we have used a fully-connected inter-clique topology for D-Cliques,
+which has the advantage of bounding the
+\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
+\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
+in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
+large compared to $c$.
+We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
+uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes 
+also scales linearly.
+We also consider another topology, which we call \textit{fractal}, that provides a
+logarithmic
+bound on the average path length. In this hierarchical scheme, 
+cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
+pair of cliques, but with only one edge between pairs of larger groups. The
+topology is built recursively such that $c$ groups will themselves form a
+larger group at the next level up. This results in at most $c$ edges per node 
+if edges are evenly distributed: i.e., each group within the same level adds 
+at most $c-1$ edges to other groups, leaving one node per group with $c-1$ 
+edges that can receive an additional edge to connect with other groups at the next level.
+Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
+the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
+Second, we look at another scheme 
+in which the number of edges scales in a near, but not quite, linear fashion.
+We propose to connect cliques according to a
+small-world-like topology~\cite{watts2000small} applied on top of a
+ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
+ring. Then each clique adds symmetric edges, both clockwise and
+counter-clockwise on the ring, with the $m$ closest cliques in sets of
+cliques that are exponentially bigger the further they are on the ring (see
+Algorithm~\ref{Algorithm:Smallworld} in the appendix for
+details on the construction). This ensures a good connectivity with other
+cliques that are close on the ring, while still keeping the average
+path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
+therefore grows in the order of $O(n\log(n))$ with the number of nodes.
--- a/mlsys2022style/exp.tex
+++ b/mlsys2022style/exp.tex
 % !TEX root = main.tex
-\section{Experiments}
+\section{Evaluation}
-\label{section:non-clustered}
+\label{section:evaluation}
-In this section, we first compare D-Cliques to alternative topologies to
+%In this section, we first compare D-Cliques to alternative topologies to
-confirm the relevance of our main design choices. Then,
+%confirm the relevance of our main design choices. Then,
-we evaluate some extensions of D-Cliques to further reduce the number of
+%we evaluate some extensions of D-Cliques to further reduce the number of
-inter-clique connections so as to gracefully scale with the number of
+%inter-clique connections so as to gracefully scale with the number of
-nodes.
+%nodes.
-\subsection{Methodology}
+ \todo{EL: Revise intro to section}
-\subsubsection{Non-IID assumptions.}
+\subsection{Experimental setup.}
-\label{section:non-iid-assumptions}
-As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
-assumption of IID data significantly challenges the learning algorithm. In
-this paper, we focus on an \textit{extreme case of label distribution skew}: we
-consider that each node only has examples from a single class.
-To isolate the effect of label distribution skew from other potentially compounding
-factors, we make the following simplifying assumptions: (1) All classes are
-equally represented in the global dataset; (2) All classes are represented on
-the same number of nodes; (3) All nodes have the same number of examples.
-We believe that these assumptions are reasonable in the context of our study
-because: (1)
-Global class
-imbalance equally
-affects the optimization process on a single node and is therefore not
-specific to the decentralized setting; (2) Our results do not exploit specific
-positions in the topology;  (3) Imbalanced dataset sizes across nodes can be
-addressed for instance by appropriately weighting the individual loss
-functions. Our results can be extended to support additional compounding factors in future work.
-\subsubsection{Experimental setup.}
 \label{section:experimental-settings}
 Our main goal is to provide a fair comparison of the convergence speed across
@@ -54,6 +31,16 @@ balanced: we use 45k/50k images of the original training set for training,
 5k/50k for validation, and all 10k examples of the test set for measuring
 prediction accuracy.
+We use the non-IID partitioning scheme proposed by ~\cite{mcmahan2016communication} 
+in their seminal Federated Learning paper for MNIST on both MNIST and CIFAR10: 
+i.e., we sort all training examples by class, then split that list in shards of 
+equal size, and distribute the shards to nodes randomly such that each node will receive two shards.
+When the number of examples of one class does not divide evenly in shards, as is the case for MNIST, some
+shards may have examples of more than one class and therefore nodes may have examples
+of up to 4 classes. However, most nodes will have examples of 2 classes.  The varying number 
+of classes, as well as the varying distribution of examples within a single node, makes the task 
+of creating cliques with low skew non-trivial.
 We
 use a logistic regression classifier for MNIST, which
 provides up to 92.5\% accuracy in the centralized setting.
@@ -88,7 +75,7 @@ node. In both cases, the topology has no effect on
 the optimization. For a certain choice of number of nodes and
 mini-batch size, both approaches are equivalent. 
-\subsection{Main Results}
+\subsection{D-Cliques match the Convergence Speed of Fully-Connected with a Fraction of the Edges}
 \begin{figure}[t]
    \centering 
@@ -124,49 +111,7 @@ compared to a fully-connected topology. Nonetheless, there is still
 significant variance in the accuracy across nodes, which is due to the bias
 introduced by inter-clique edges. We address this issue in the next section.
-\subsection{Importance of Clique Averaging and Momentum}
+\subsection{D-Cliques Converge Faster than Random Graphs}
-% To regenerate figure, from results/mnist:
-% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
-\begin{figure}[t]
-         \centering
-         \includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
-\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
-\end{figure}
-As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
-significantly reduces the variance of models across nodes and accelerates
-convergence to reach the same level as the one obtained with a
-fully-connected topology. Note that Clique Averaging induces a small
-additional cost, as gradients
-and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
-\begin{figure}[t]
-    \centering 
-    % To regenerate figure, from results/cifar10
-    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET  no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum'  '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum'  --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted'         
-    \begin{subfigure}[b]{0.35\textwidth}
-    \centering
-    \includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-effect}
-    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-effect} Without Clique Averaging }
-    \end{subfigure}
-    \hfill
-    % To regenerate figure, from results/cifar10
-    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted' 
-    \begin{subfigure}[b]{0.35\textwidth}
-    \centering
-    \includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect}
-    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging}
-    \end{subfigure}
-\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
-\end{figure}
-As shown in
-Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, 
-the use of Clique Averaging restores the benefits of momentum and closes the gap
-with the centralized setting.
-\subsection{Comparing D-Cliques to Other Sparse Topologies}
 We demonstrate the advantages of D-Cliques over alternative sparse topologies
 that have a similar number of edges. First, we consider topologies in which
@@ -242,6 +187,52 @@ Overall, these results show that achieving fast convergence on non-IID
 data with sparse topologies requires a very careful design, as we have
 proposed with D-Cliques.
+\subsection{Cliques built with Greedy Swap Converge Significantly Faster than Random Cliques}
+\subsection{Clique Averaging and Momentum are Necessary}
+% To regenerate figure, from results/mnist:
+% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
+\begin{figure}[t]
+         \centering
+         \includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
+\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
+\end{figure}
+As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
+significantly reduces the variance of models across nodes and accelerates
+convergence to reach the same level as the one obtained with a
+fully-connected topology. Note that Clique Averaging induces a small
+additional cost, as gradients
+and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
+\begin{figure}[t]
+    \centering 
+    % To regenerate figure, from results/cifar10
+    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET  no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum'  '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum'  --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted'         
+    \begin{subfigure}[b]{0.35\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-effect}
+    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-effect} Without Clique Averaging }
+    \end{subfigure}
+    \hfill
+    % To regenerate figure, from results/cifar10
+    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted' 
+    \begin{subfigure}[b]{0.35\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect}
+    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging}
+    \end{subfigure}
+\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
+\end{figure}
+As shown in
+Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, 
+the use of Clique Averaging restores the benefits of momentum and closes the gap
+with the centralized setting.
+\subsection{Full Intraclique Connectivity is Necessary}
 \begin{figure*}[t]
     \centering
@@ -280,59 +271,15 @@ proposed with D-Cliques.
 \caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity}
 \end{figure*}
-\subsection{Scaling up D-Cliques with Sparser Inter-Clique Topologies}
+\subsection{D-Cliques appear to Scale with Sparser Inter-Clique Topologies}
-\label{section:interclique-topologies}
-So far, we have used a fully-connected inter-clique topology for D-Cliques,
-which has the advantage of bounding the
-\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
-\frac{n}{L}(\frac{n}{L} - 1)$ inter-clique edges, which scales quadratically
-in the number of nodes $n$ for a given clique size $L$\footnote{We consider 
-\textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
-large compared to $L$.
 In this last series of experiments, we evaluate the effect of choosing sparser
 inter-clique topologies on the convergence speed for a larger network of 1000
-nodes. We compare the scalability and convergence speed of several
+nodes.  We compare the scalability and convergence speed of the several
-D-Cliques variants, which all use $O(nL)$ edges
+D-Cliques variants introduced in Section~\ref{section:interclique-topologies}.
-to create cliques as a starting point.
-We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
-uses $\frac{2n}{L}$ inter-clique edges but its average path length between
-nodes 
-also scales linearly.
-We also consider another topology, which we call \textit{fractal}, that provides a
-logarithmic
-bound on the average path length. In this hierarchical scheme, 
-cliques are assembled in larger groups of $L$ cliques that are connected
-internally with one edge per
-pair of cliques, but with only one edge between pairs of larger groups. The
-topology is built recursively such that $L$ groups will themselves form a
-larger group at the next level up. This results in at most $L$ edges per node 
-if edges are evenly distributed: i.e., each group within the same level adds 
-at most $c-1$ edges to other groups, leaving one node per group with $L-1$ 
-edges that can receive an additional edge to connect with other groups at the next level.
-Since nodes have at most $L$ edges, $n$ nodes have at most $nL$ edges,
-therefore
-the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
-Second, we look at another scheme 
-in which the number of edges scales in a near, but not quite, linear fashion.
-We propose to connect cliques according to a
-small-world-like topology~\cite{watts2000small} applied on top of a
-ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
-ring. Then each clique adds symmetric edges, both clockwise and
-counter-clockwise on the ring, with the $m$ closest cliques in sets of
-cliques that are exponentially bigger the further they are on the ring (see
-Algorithm~\ref{Algorithm:Smallworld} in the appendix for
-details on the construction). This ensures a good connectivity with other
-cliques that are close on the ring, while still keeping the average
-path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
-therefore grows in the order of $O(n\log(n))$ with the number of nodes.
 Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence
-speed of all the above schemes on MNIST and CIFAR10, compared to the ideal
+speed of all sparse inter-clique topologies on MNIST and CIFAR10, compared to the ideal
 baseline
 of a
 single IID node performing the same number of updates per epoch (representing
@@ -390,7 +337,11 @@ show that D-Cliques can nicely scale with the number of nodes.
 \caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
 \end{figure*}
-\subsection{Cost of Constructing Cliques}
+\subsection{Good Cliques can be Constructed Efficiently}
 \label{section:cost-cliques}
+\subsubsection{Quality of Construction (Skew)}
+\subsubsection{Cost of Construction}
 \dots \todo{EL: Add plots showing convergence speed in terms of skew vs iteration number, as well as absolute computation time}
\ No newline at end of file
--- a/mlsys2022style/main.tex
+++ b/mlsys2022style/main.tex
@@ -42,7 +42,7 @@
 \begin{document}
 \twocolumn[
-\mlsystitle{D-Cliques: Compensating NonIIDness with Topology in Decentralized
+\mlsystitle{D-Cliques: Compensating Data Heterogeneity with Topology in Decentralized
 Federated Learning}
 % It is OKAY to include author information, even for blind