Completed first draft of evaluation section

ff482913 · Erick Lavoie · 437d8e52 · 437d8e52 · ff482913 · ff482913
Commit ff482913 authored 3 years ago by Erick Lavoie
--- a/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png
+++ b/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png
--- a/mlsys2022style/d-cliques.tex
+++ b/mlsys2022style/d-cliques.tex
@@ -345,7 +345,8 @@ of the local models across nodes.
 inter-clique connections (see main text for details).}
 \end{figure}

-\paragraph{Clique Averaging.} We address this problem by adding \emph{Clique
+\paragraph{\label{section:clique-averaging} Clique Averaging.} 
+We address this problem by adding \emph{Clique
 Averaging} to D-SGD
 (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), which essentially
 decouples gradient averaging from model averaging. The idea is to use only the

--- a/mlsys2022style/exp.tex
+++ b/mlsys2022style/exp.tex
@@ -3,13 +3,13 @@
 \section{Evaluation}
 \label{section:evaluation}

-%In this section, we first compare D-Cliques to alternative topologies to
-%confirm the relevance of our main design choices. Then,
-%we evaluate some extensions of D-Cliques to further reduce the number of
-%inter-clique connections so as to gracefully scale with the number of
-%nodes.
-
- \todo{EL: Revise intro to section}
+In this section, we first compare D-Cliques to alternative topologies to
+show their benefits and the relevance of our main design choices. Then, 
+we evaluate different inter-clique topologies to further to further reduce the number of
+inter-clique connections so as to gracefully scale with the number of
+nodes. We finally show D-Cliques to be resilient to some intra-clique 
+connectivity failures and that Greedy Swap (Alg.~\ref{Algorithm:greedy-swap}) 
+constructs cliques with low skew efficiently.

 \subsection{Experimental setup.}
 \label{section:experimental-settings}
@@ -21,15 +21,16 @@ can remove much of the effect of label distribution skew.

 We experiment with two datasets: MNIST~\cite{mnistWebsite} and
 CIFAR10~\cite{krizhevsky2009learning}, which both have $L=10$ classes.
-For MNIST, we use 45k and 10k examples from the original 60k
-training set for training and validation respectively. The remaining 5k
-training examples were randomly removed to ensure all 10 classes are balanced
-while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
-We use all 10k examples of
-the test set to measure prediction accuracy. For CIFAR10, classes are evenly
-balanced: we use 45k/50k images of the original training set for training,
-5k/50k for validation, and all 10k examples of the test set for measuring
-prediction accuracy.
+For MNIST,  we use 50k and 10k examples from the original 60k training 
+set for training and validation respectively. We use all 10k examples of 
+the test set to measure prediction accuracy.  The validation set preserves the
+original unbalanced ratio of the classes in the test set.
+For CIFAR10, classes are evenly balanced: we initially used 45k/50k images 
+of the original training set for training, 5k/50k for validation, and all 10k examples 
+of the test set for measuring prediction accuracy. After tuning hyper-parameters
+on initial experiments, we then used all 50k images of the original training set
+for training for all experiments, as the 45k did not split evenly with 1000 nodes
+in the partitioning scheme explained in the next paragraph.

 We use the non-IID partitioning scheme proposed by ~\cite{mcmahan2016communication} 
 in their seminal Federated Learning paper for MNIST on both MNIST and CIFAR10: 
@@ -45,7 +46,7 @@ We
 use a logistic regression classifier for MNIST, which
 provides up to 92.5\% accuracy in the centralized setting.
 For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
-deep convolutional network which achieves an accuracy of $72.3\%$ in the
+deep convolutional network which achieves an accuracy of $74.15\%$ in the
 centralized setting.
 These models are thus reasonably accurate (which is sufficient to
 study the effect of the topology) while being sufficiently fast to train in a
@@ -56,11 +57,13 @@ validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
 MNIST and $0.002$ and $20$ for CIFAR10.
 For CIFAR10, we additionally use a momentum of $0.9$.

-We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
+We evaluate 100- and 1000-node networks by creating multiple models 
+in memory and simulating the exchange of messages between nodes.
 To ignore the impact of distributed execution strategies and system
 optimization techniques, we report the test accuracy of all nodes (min, max,
 average) as a function of the number of times each example of the dataset has
-been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
+been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic 
+case of a single node sampling the full distribution.
 To further make results comparable across different number of nodes, we lower
 the batch size proportionally to the number of nodes added, and inversely,
 e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
@@ -72,18 +75,36 @@ resulting communication overhead is impractical.}
 Finally, we compare our results against an ideal baseline: either a
 fully-connected network topology with the same number of nodes or a single IID
 node. In both cases, the topology has no effect on
-the optimization. For a certain choice of number of nodes and
-mini-batch size, both approaches are equivalent. 
+the optimization. Both approaches effectively optimize a single model and sample
+uniformly from the global distribution, yet we have observed a fully-connected network 
+ to convergence slightly faster and obtain slightly 
+better final accuracy than a single node sampling from the global distribution in the presence 
+of data heterogeneity\footnote{We 
+conjecture that an heterogeneous data partition in a fully-connected network may force 
+more balanced representation of all classes in the union of all mini-batches, leading to better convergence.}. 
+We therefore compare against a fully-connected network unless the simulation time
+prevented us to obtain results in time for submission, in which case we use a single node IID.

 \subsection{D-Cliques match the Convergence Speed of Fully-Connected with a Fraction of the Edges}
+\label{section:d-cliques-vs-fully-connected}
+
+In this first experiment, we show that D-Cliques with Clique Averaging (and Momentum mentioned) converges 
+almost as fast as a fully-connected network on both MNIST and CIFAR10. Figure~\ref{fig:convergence-speed-dc-vs-fc-2-shards-per-node} 
+illustrates the convergence speed of D-Cliques with $n=100$ nodes on MNIST (with Clique Averaging) 
+and CIFAR10 (with Clique Averaging and Momentum). Observe that the convergence speed is
+very close to that of a fully-connected topology, and significantly better than with
+a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). 
+It also has less variance than both the ring and grid. With 
+100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
+compared to a fully-connected topology.

 % From directory 'results-v2':
 % MNIST
 % python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name fully-connected d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% python $TOOLS/analyze/diff.py --rundirs all/2021-09-28-23:16:47-CEST-labostrex117 all/2021-09-28-23:18:49-CEST-labostrex119 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 80 --ymax 92.5 --yaxis test-accuracy --labels 'fully-connected' 'd-cliques (fc) w/ cliq-avg' --save-figure ../mlsys2022style/figures/convergence-speed-mnist-dc-fc-vs-fc-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18
+% python $TOOLS/analyze/diff.py --rundirs all/2021-09-28-23:16:47-CEST-labostrex117 all/2021-09-28-23:18:49-CEST-labostrex119 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 80 --ymax 92.5 --yaxis test-accuracy --labels 'fully-connected' 'd-cliques (fc) w/ cliq-avg' --save-figure ../mlsys2022style/figures/convergence-speed-mnist-dc-fc-vs-fc-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 3
 % CIFAR10
 % python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name fully-connected d-cliques/greedy-swap --nodes:name 2-shards-eq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% python $TOOLS/analyze/diff.py --rundirs all/2021-10-02-18:58:22-CEST-labostrex114 all/2021-10-03-19:53:21-CEST-labostrex117 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'fully-connected w/ mom.' 'd-cliques (fc) w/ c-avg, mom.' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-dc-fc-vs-fc-2-shards-per-node.png --linestyles 'solid' 'dashed' --legend 'lower right' --font-size 18
+% python $TOOLS/analyze/diff.py --rundirs all/2021-10-02-18:58:22-CEST-labostrex114 all/2021-10-03-19:53:21-CEST-labostrex117 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'fully-connected w/ mom.' 'd-cliques (fc) w/ c-avg, mom.' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-dc-fc-vs-fc-2-shards-per-node.png --linestyles 'solid' 'dashed' --legend 'lower right' --font-size 18 --linewidth 3

 \begin{figure}[htbp]
    \centering        
@@ -98,45 +119,42 @@ mini-batch size, both approaches are equivalent.
    \includegraphics[width=\textwidth]{figures/convergence-speed-cifar10-dc-fc-vs-fc-2-shards-per-node}
    \caption{\label{fig:convergence-speed-cifar10-dc-fc-vs-fc-2-shards-per-node} CIFAR10}
    \end{subfigure}
-\caption{\label{fig:convergence-speed-dc-vs-fc-2-shards-per-node} Convergence Speed of D-Cliques constructed with Greedy Swap Compared to Fully-Connected on 100 Nodes (2 shards/node).}
+\caption{\label{fig:convergence-speed-dc-vs-fc-2-shards-per-node} Convergence Speed of D-Cliques constructed with Greedy Swap Compared to Fully-Connected on 100 Nodes (2 shards/node). Bold line is the average accuracy over all nodes. Thinner upper and lower lines and maximum and minimum accuracy over all nodes.}
 \end{figure}

+\subsection{Clique Averaging and Momentum are Beneficial and Sometimes Necessary}

-Figure~\ref{fig:convergence-speed-dc-vs-fc-2-shards-per-node} illustrates the
-convergence speed of D-Cliques with $n=100$ nodes on MNIST (with Clique Averaging) and CIFAR10 (with Clique Averaging and Momentum). Observe that the
-convergence speed is
-very close
-to that of a fully-connected topology, and significantly better than with
-a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 
-100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
-compared to a fully-connected topology.
-
-\subsection{Clique Averaging and Momentum are Necessary}
+Figure~\ref{fig:d-clique-mnist-clique-avg} shows that Clique Averaging (Alg.~\autoref{Algorithm:Clique-Unbiased-D-PSGD})
+ reduces the variance of models across nodes and accelerates
+convergence on MNIST. Note that Clique Averaging induces a small
+additional cost, as gradients
+and models need to be sent in two separate rounds of messages. 
+Nonetheless, compared to fully connecting all nodes, the total number 
+of messages is reduced by $\approx 80\%$.

 % From directory 'results-v2':
 % MNIST
 % python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-03:53:42-CEST-labostrex119 all/2021-09-28-23:18:49-CEST-labostrex119 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy --labels 'd-cliques w/o c-avg.' 'd-cliques w/ c-avg.' --save-figure ../mlsys2022style/figures/convergence-speed-mnist-dc-no-c-avg-vs-c-avg-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18
+% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-03:53:42-CEST-labostrex119 all/2021-09-28-23:18:49-CEST-labostrex119 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy --labels 'd-cliques (fc) w/o c-avg.' 'd-cliques (fc) w/ c-avg.' --save-figure ../mlsys2022style/figures/convergence-speed-mnist-dc-no-c-avg-vs-c-avg-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 2.5
 \begin{figure}[htbp]
         \centering
-         \includegraphics[width=0.35\textwidth]{figures/convergence-speed-mnist-dc-no-c-avg-vs-c-avg-2-shards-per-node}
+         \includegraphics[width=0.23\textwidth]{figures/convergence-speed-mnist-dc-no-c-avg-vs-c-avg-2-shards-per-node}
 \caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y axis starts at 89.}
 \end{figure}

-As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
-significantly reduces the variance of models across nodes and accelerates
-convergence to reach the same level as the one obtained with a
-fully-connected topology. Note that Clique Averaging induces a small
-additional cost, as gradients
-and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
-
+Figure~\ref{fig:cifar10-c-avg-momentum} shows the interaction between
+Clique Averaging and Momentum on CIFAR10. Without Clique Averaging,
+the use of momentum is actually detrimental. With Clique Averaging, the 
+situation reverses and momentum is again beneficial. The combination
+of both has the fastest convergence speed and the lowest variance among all
+four possibilities.

 % CIFAR10
 %  python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name d-cliques/greedy-swap --nodes:name 2-shards-eq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
 % w/o Clique Averaging
-% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-23:37:42-CEST-labostrex117 all/2021-10-04-03:13:46-CEST-labostrex117  --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques w/o momentum' 'd-cliques w/ momentum' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-wo-c-avg-no-mom-vs-mom-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18
+% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-23:37:42-CEST-labostrex117 all/2021-10-04-03:13:46-CEST-labostrex117  --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques (fc) w/o momentum' 'd-cliques (fc) w/ momentum' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-wo-c-avg-no-mom-vs-mom-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 3
 % w/ Clique Averaging
-% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-16:10:34-CEST-labostrex117 all/2021-10-03-19:53:21-CEST-labostrex117 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques w/o momentum' 'd-cliques w/ momentum' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-w-c-avg-no-mom-vs-mom-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18
+% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-16:10:34-CEST-labostrex117 all/2021-10-03-19:53:21-CEST-labostrex117 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques (fc) w/o momentum' 'd-cliques (fc) w/ momentum' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-w-c-avg-no-mom-vs-mom-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 3 --legend 'lower right'

 \begin{figure}[htbp]
    \centering        
@@ -154,23 +172,20 @@ and models need to be sent in two separate rounds of messages. Nonetheless, comp
 \caption{\label{fig:cifar10-c-avg-momentum} Effect of Clique Averaging and Momentum on CIFAR10 with LeNet.}
 \end{figure}

-As shown in
-Figure~\ref{fig:cifar10-c-avg-momentum}, 
-the use of Clique Averaging restores the benefits of momentum and closes the gap
-with the centralized setting.
-
-
-
 \subsection{D-Cliques Converge Faster than Random Graphs}
+\autoref{fig:d-cliques-comparison-to-non-clustered-topologies} shows that D-Cliques, even
+without Clique Averaging or Momentum converge faster and with lower variance than a
+random graph with a similar number of edges (10) per node, and therefore that a careful
+design of the topology is indeed necessary.

 % From directory 'results-v2':
 % MNIST
 % python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name random-graph d-cliques/greedy-swap greedy-neighbourhood-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-03:53:42-CEST-labostrex119 all/2021-09-29-22:17:08-CEST-labostrex118 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 80 --ymax 92.5 --yaxis test-accuracy --labels 'd-cliques (fc) w/o cliq-avg'  'random 10' --save-figure ../mlsys2022style/figures/convergence-mnist-random-vs-d-cliques-2-shards.png --linestyles 'solid' 'dashed' --font-size 18
+% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-03:53:42-CEST-labostrex119 all/2021-09-29-22:17:08-CEST-labostrex118 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 80 --ymax 92.5 --yaxis test-accuracy --labels 'd-cliques (fc) w/o cliq-avg'  'random 10' --save-figure ../mlsys2022style/figures/convergence-mnist-random-vs-d-cliques-2-shards.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 3

 % CIFAR10
 % python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name d-cliques/greedy-swap random-graph --nodes:nb-nodes 100 --algorithm:learning-momentum 0.9  | python $TOOLS/analyze/diff.py
-% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-23:37:42-CEST-labostrex117 all/2021-10-05-18:38:30-CEST-labostrex115 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques (fc) w/o cliq-avg w/o mom.'  'random 10 w/o mom.' --save-figure ../mlsys2022style/figures/convergence-cifar10-random-vs-d-cliques-2-shards.png --linestyles 'solid' 'dashed' --font-size 18
+% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-23:37:42-CEST-labostrex117 all/2021-10-05-18:38:30-CEST-labostrex115 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques (fc) w/o cliq-avg w/o mom.'  'random 10 w/o mom.' --save-figure ../mlsys2022style/figures/convergence-cifar10-random-vs-d-cliques-2-shards.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 3

 \begin{figure}[htbp]
     \centering     
@@ -188,23 +203,47 @@ with the centralized setting.
 \caption{\label{fig:convergence-random-vs-d-cliques-2-shards} Comparison to Random Graph with 10 edges per node \textit{without} Clique Averaging or Momentum (see main text for justification).} 
 \end{figure}

+In comparison to a random graph however, D-Cliques provide additional benefits: they ensure
+a diverse representation of all classes in the immediate neighbourhood of all nodes; they enable
+ Clique Averaging to debias gradients; and they provide a high-level of clustering, i.e. neighbors 
+ of a node are neighbors themselves, which tends to lower variance.
+In order, to distinguish the effect of the first two from the last, we compare D-Cliques to other variations 
+of random graphs: (1) with the additional constraint that all classes should be represented in the immediate neighborhood of all nodes 
+(i.e. 'all classes repr.'), and (2) in combination with unbiased gradients computed using 
+the average of the gradients of all neighbors for all nodes. Satisfying the first constraint while obtaining
+the same skew as D-Cliques built with Greedy Swap was challenging with the current partitioning scheme, so 
+we performed the experiments in a more heterogeneous setting in which each node has examples of only 1 class.
+In this setting, it is easy to construct cliques and a random graph such that the neighborhood of each node in both 
+cases has a skew of 0.
+
+
+
 \begin{figure}[htbp]
     \centering     
         \begin{subfigure}[b]{0.23\textwidth}
         \centering
-         \includegraphics[width=\textwidth]{../figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies}
+         % From directory results/mnist:
+% python ../../../../Software/non-iid-topology-simulator/tools/v1/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  random-10/all/2021-07-23-11:59:56-CEST  random-10-diverse/all/2021-03-17-20:28:35-CET --labels 'd-clique (fcc)' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes represented)' --add-min-max --legend 'lower right' --ymin 80 --ymax 92.5 --yaxis test-accuracy --save-figure ../../mlsys2022style/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png --font-size 13 --linestyles 'solid' 'dashed' 'dotted' 'dashdot' --linewidth 3
+         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies}
                  \caption{MNIST}
         \end{subfigure}
                 \hfill                      
        \begin{subfigure}[b]{0.23\textwidth}
        \centering
-         \includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}
+        % To regenerate the figure, from directory results/cifar10
+% python ../../../../Software/non-iid-topology-simulator/tools/v1/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-07-23-14:33:48-CEST  random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' --add-min-max --legend 'upper left' --yaxis test-accuracy --save-figure ../../mlsys2022style/figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png --ymax 119 --font-size 13  --linestyles 'solid' 'dashed' 'dotted' 'dashdot' 'solid' --markers '' '' '' '' 'o' --linewidth 3
+         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}
         \caption{CIFAR10}
     \end{subfigure} 
 \caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Random Graph with 10 edges per node, \textit{with} Clique Averaging (and Momentum) as well as analogous Neighbourhood Averaging for Random Graph \textit{in a more stringent partitioning of 1 class/node}\textsuperscript{*}.} 
-\footnotesize\textsuperscript{*}\textit{These results were obtained with a previous version of the simulator but should be consistent with the latest. They will be updated for the final version.}
+\footnotesize\textsuperscript{*}\textit{These results were obtained with a previous version of the simulator but should be consistent with the latest. They will be updated for the final version of the paper.}
 \end{figure}

+\autoref{fig:d-cliques-comparison-to-non-clustered-topologies} shows the results for MNIST and CIFAR10. In the case of MNIST,
+D-Cliques converge faster than all other options. In the case of CIFAR10, the clustering appears to be critical
+for good convergence speed: even a random graph with diverse neighborhoods and unbiased gradients 
+converges significantly slower.
+
 %We demonstrate the advantages of D-Cliques over alternative sparse topologies
 %that have a similar number of edges. First, we consider topologies in which
 %the neighbors of each node are selected at random (hence without any clique
@@ -275,92 +314,15 @@ with the centralized setting.
 %data with sparse topologies requires a very careful design, as we have
 %proposed with D-Cliques.

-\subsection{Cliques built with Greedy Swap Converge Faster than Random Cliques}
-
-% From directory 'results-v2':
-% MNIST
-% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name d-cliques/random-cliques d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-22:12:59-CEST-labostrex114 all/2021-09-28-23:18:49-CEST-labostrex119 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 80 --ymax 92.5 --yaxis test-accuracy --labels 'd-cliques random' 'd-cliques greedy-swap' --save-figure ../mlsys2022style/figures/convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18
-% CIFAR10
-%  python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name d-cliques/random-cliques d-cliques/greedy-swap --nodes:name 2-shards-eq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% python $TOOLS/analyze/diff.py --rundirs all/2021-10-04-21:18:33-CEST-labostrex117 all/2021-10-03-19:53:21-CEST-labostrex117 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques random' 'd-cliques greedy-swap' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18
-
-\begin{figure}[htbp]
-    \centering        
-    \begin{subfigure}[b]{0.23\textwidth}
-    \centering
-    \includegraphics[width=\textwidth]{figures/convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node}
-    \caption{\label{fig:convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node} MNIST}
-    \end{subfigure}
-    \hfill
-    \begin{subfigure}[b]{0.23\textwidth}
-    \centering
-    \includegraphics[width=\textwidth]{figures/convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node}
-    \caption{\label{fig:convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node} CIFAR10}
-    \end{subfigure}
-\caption{\label{fig:convergence-speed-dc-random-vs-dc-gs-2-shards-per-node} Convergence Speed of D-Cliques constructed Randomly vs Greedy Swap on 100 Nodes (2 shards/node).}
-\end{figure}
-
-\subsection{D-Cliques May Tolerate Some Intra-Connectivity Failures}

-% From directory 'results-v2':
-% MNIST
-% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% w/o Clique Gradient
-% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-03:53:42-CEST-labostrex119 all/2021-10-01-21:44:14-CEST-labostrex113 all/2021-10-02-06:53:40-CEST-labostrex113 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-mnist-wo-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18
-% w/ Clique Gradient
-% python $TOOLS/analyze/diff.py --rundirs all/2021-09-28-23:18:49-CEST-labostrex119 all/2021-10-01-17:08:42-CEST-labostrex113 all/2021-10-02-02:17:43-CEST-labostrex113 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy  --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-mnist-w-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18
-\begin{figure}[htbp]
-     \centering
-
-\begin{subfigure}[htbp]{0.23\textwidth}
-     \centering   
-         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-wo-clique-avg-impact-of-edge-removal}     
-\caption{\label{fig:d-cliques-mnist-wo-clique-avg-impact-of-edge-removal} Without Clique Averaging }
-\end{subfigure}
-\hfill
-\begin{subfigure}[htbp]{0.23\textwidth}
-     \centering
-         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-w-clique-avg-impact-of-edge-removal}
-\caption{\label{fig:d-cliques-mnist-w-clique-avg-impact-of-edge-removal} With Clique Averaging}
-\end{subfigure}
-\caption{\label{fig:d-cliques-mnist-intra-connectivity} MNIST: Impact of Intra-clique Connectivity Failures. Y axis starts at 89.}
-\end{figure}
-
-% CIFAR10
-% python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name d-cliques/greedy-swap --nodes:name 2-shards-eq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
-% w/o Clique Gradient
-% python $TOOLS/analyze/diff.py --rundirs all/2021-10-04-03:13:46-CEST-labostrex117 all/2021-10-06-17:58:49-CEST-labostrex112 all/2021-10-06-17:45:22-CEST-labostrex115  --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 80 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18
-% w/ Clique Gradient
-% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-19:53:21-CEST-labostrex117 all/2021-10-06-12:46:49-CEST-labostrex112 all/2021-10-06-12:49:51-CEST-labostrex115 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 80 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-cifar10-w-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18
-
-\begin{figure}[htbp]
-     \centering
-\begin{subfigure}[htbp]{0.23\textwidth}
-     \centering   
-         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal}     
-\caption{\label{fig:d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal} Without Clique Averaging }
-\end{subfigure}
-\hfill
-\begin{subfigure}[htbp]{0.23\textwidth}
-     \centering
-         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-w-clique-avg-impact-of-edge-removal}
-\caption{\label{fig:d-cliques-cifar10-w-clique-avg-impact-of-edge-removal} With Clique Averaging}
-\end{subfigure}
-\caption{\label{fig:d-cliques-cifar10-intra-connectivity} CIFAR: Impact of Intra-clique Connectivity Failures (with Momentum).}
-\end{figure}

 \subsection{D-Cliques Scale with Sparser Inter-Clique Topologies}

-In this last series of experiments, we evaluate the effect of choosing sparser
-inter-clique topologies on the convergence speed for a larger network of 1000
-nodes.  We compare the scalability and convergence speed of the several
-D-Cliques variants introduced in Section~\ref{section:interclique-topologies}.
-
-Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence
-speed of all sparse inter-clique topologies on MNIST and CIFAR10, compared to the ideal
-baseline
-of a
+\autoref{fig:d-cliques-scaling-mnist-1000} and \autoref{fig:d-cliques-scaling-cifar10-1000}  
+show the convergence
+speed of all sparse inter-clique topologies of Section~\ref{section:interclique-topologies}
+ respectively on MNIST and CIFAR10 on a larger network of 1000 nodes, compared to the ideal
+baseline of a
 single IID node performing the same number of updates per epoch (representing
 the fastest convergence speed achievable if topology had no impact). Among the linear schemes, the ring
 topology converges but is much slower than our fractal scheme. Among the super-linear schemes, the small-world
@@ -373,10 +335,9 @@ fully-connected topology still offers
 significant benefits with 1000 nodes, as it represents a 98\% reduction in the
 number of edges compared to fully connecting individual nodes (18.9 edges on
 average instead of 999) and a 96\% reduction in the number of messages (37.8
-messages per round per node on average instead of 999). We refer to
-Appendix~\ref{app:scaling} for additional results comparing the convergence
-speed across different number of nodes. Overall, these results
-show that D-Cliques can nicely scale with the number of nodes.
+messages per round per node on average instead of 999). 
+%We refer to Appendix~\ref{app:scaling} for additional results comparing the convergence speed across different number of nodes. 
+Overall, these results show that D-Cliques can nicely scale with the number of nodes.

 % From directory 'results-v2':
 % MNIST
@@ -445,9 +406,83 @@ show that D-Cliques can nicely scale with the number of nodes.
 %\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
 %\end{figure}

-\subsection{Cliques with Low Skew can be Constructed Efficiently with Greedy Swap}
+\subsection{D-Cliques Can Tolerate Some Intra-Connectivity Failures}
+
+We measured the impact of randomly removing 1 and 5 intra-clique edges per
+ clique to assess how critical full connectivity is within cliques.
+
+\autoref{fig:d-cliques-mnist-intra-connectivity} shows that for  MNIST, when not using Clique Averaging, 
+removing edges decreases slightly the convergence speed and increases 
+the variance between nodes. However, when using Clique Averaging, even removing 5 edges affects very little 
+the convergence speed.
+
+% From directory 'results-v2':
+% MNIST
+% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
+% w/o Clique Gradient
+% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-03:53:42-CEST-labostrex119 all/2021-10-01-21:44:14-CEST-labostrex113 all/2021-10-02-06:53:40-CEST-labostrex113 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-mnist-wo-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18 --linewidth 3
+% w/ Clique Gradient
+% python $TOOLS/analyze/diff.py --rundirs all/2021-09-28-23:18:49-CEST-labostrex119 all/2021-10-01-17:08:42-CEST-labostrex113 all/2021-10-02-02:17:43-CEST-labostrex113 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy  --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-mnist-w-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18 --linewidth 3
+\begin{figure}[htbp]
+     \centering
+
+\begin{subfigure}[htbp]{0.23\textwidth}
+     \centering   
+         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-wo-clique-avg-impact-of-edge-removal}     
+\caption{\label{fig:d-cliques-mnist-wo-clique-avg-impact-of-edge-removal} Without Clique Averaging }
+\end{subfigure}
+\hfill
+\begin{subfigure}[htbp]{0.23\textwidth}
+     \centering
+         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-w-clique-avg-impact-of-edge-removal}
+\caption{\label{fig:d-cliques-mnist-w-clique-avg-impact-of-edge-removal} With Clique Averaging}
+\end{subfigure}
+\caption{\label{fig:d-cliques-mnist-intra-connectivity} MNIST: Impact of Intra-clique Connectivity Failures. Y axis starts at 89.}
+\end{figure}
+
+\autoref{fig:d-cliques-cifar10-intra-connectivity} shows that for CIFAR10, the impact is stronger. We show the results with and without Clique Averaging
+with momentum in both cases, as momentum is critical for obtaining the best convergence speed. Without Clique Averaging,
+removing edges has a small effect on convergence speed and variance, but the convergence speed is too slow to be practical.
+With Clique Averaging, the removing edges has initially a small effect, when only removing one edge per clique. However, It becomes
+significant when removing 5 edges per clique. D-Cliques can therefore tolerate some connectivity failures between clique members
+but the number of those failures depends on the dataset and model trained.
+% CIFAR10
+% python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name d-cliques/greedy-swap --nodes:name 2-shards-eq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
+% w/o Clique Gradient
+% python $TOOLS/analyze/diff.py --rundirs all/2021-10-04-03:13:46-CEST-labostrex117 all/2021-10-06-17:58:49-CEST-labostrex112 all/2021-10-06-17:45:22-CEST-labostrex115  --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 80 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18 --linewidth 3
+% w/ Clique Gradient
+% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-19:53:21-CEST-labostrex117 all/2021-10-06-12:46:49-CEST-labostrex112 all/2021-10-06-12:49:51-CEST-labostrex115 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 80 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-cifar10-w-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18 --linewidth 3
+
+\begin{figure}[htbp]
+     \centering
+\begin{subfigure}[htbp]{0.23\textwidth}
+     \centering   
+         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal}     
+\caption{\label{fig:d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal} Without Clique Averaging }
+\end{subfigure}
+\hfill
+\begin{subfigure}[htbp]{0.23\textwidth}
+     \centering
+         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-w-clique-avg-impact-of-edge-removal}
+\caption{\label{fig:d-cliques-cifar10-w-clique-avg-impact-of-edge-removal} With Clique Averaging}
+\end{subfigure}
+\caption{\label{fig:d-cliques-cifar10-intra-connectivity} CIFAR: Impact of Intra-clique Connectivity Failures (with Momentum).}
+\end{figure}
+
+\subsection{Greedy Swap Improves Random Cliques at an Affordable Cost}
+\label{section:greedy-swap-vs-random-cliques}
+
+In the next two sub-sections, we compare cliques built with Greedy Swap (Alg.~\ref{Algorithm:greedy-swap})
+to Random Cliques, on their quality (skew),  the cost 
+of their construction, and their convergence speed.
+
+\subsubsection{Cliques with Low Skew can be Constructed Efficiently with Greedy Swap}
 \label{section:cost-cliques}

+We compared the final average skew of 10 cliques created either randomly or with Greedy Swap,
+over 100 experiments after 1000 steps. \autoref{fig:skew-convergence-speed-2-shards}, in the form of an histogram,
+ shows that Greedy Swap generate cliques of significantly lower skew, close to 0 in a majority of cases for both MNIST and CIFAR10.
+
 % MNIST
 % python $TOOLS/plot/skew/final-distribution.py --rundirs skews-mnist/* --save-figure ../mlsys2022style/figures/final-skew-distribution-mnist.png --labels 'Greedy Swap' 'Random Cliques' --linewidth 2.5 --font-size 18 --linestyles 'solid' 'dashed'
 % CIFAR10
@@ -469,9 +504,48 @@ show that D-Cliques can nicely scale with the number of nodes.
 \caption{\label{fig:final-skew-distribution} Final Quality of Cliques (Skew) with a Maximum Size of 10 over 100 Experiments.}
 \end{figure}

+\autoref{fig:skew-convergence-speed-2-shards} shows such a low skew can be achieved 
+in less than 400 steps for both MNIST and CIFAR, which in practice takes less than 6 seconds in Python 3.7 on a 
+Macbook Pro 2020 for a network of 100 nodes and cliques of size 10. Greedy Swap 
+is therefore fast and efficient. Moreover, it illustrates that an unbalanced number of examples
+between classes makes the construction of cliques with low skew harder and slower.
+
 %python $TOOLS/analyze/filter.py skews --topology:name d-cliques/greedy-swap | python $TOOLS/plot/skew/convergence.py --max-steps 400 --labels 'MNIST (unbalanced classes)' 'CIFAR10 (balanced classes)' --linewidth 2.5 --save-figure ../mlsys2022style/figures/skew-convergence-speed-2-shards.png
 \begin{figure}[htbp]
    \centering
    \includegraphics[width=0.3\textwidth]{figures/skew-convergence-speed-2-shards}
    \caption{\label{fig:skew-convergence-speed-2-shards} Speed of Skew Decrease during Clique Construction. Bold line is the average over 100 experiments and 10 cliques/experiments. Thin lines are respectively the minimum and maximum over all experiments. In wall-clock time, 1000 steps take less than 6 seconds in Python 3.7 on a MacBook Pro 2020.}
-\end{figure}
\ No newline at end of file
+\end{figure}
+
+\subsubsection{Cliques built with Greedy Swap Converge Faster than Random Cliques}
+
+\autoref{fig:convergence-speed-dc-random-vs-dc-gs-2-shards-per-node} compares
+the convergence speed of cliques optimized with Greedy Swap for 1000 steps with cliques built randomly 
+(equivalent to Greedy Swap with 0 steps). For both MNIST and CIFAR10, convergence speed
+increases significantly and variance between nodes decreases dramatically. Decreasing the skew of cliques
+is therefore critical to convergence speed.
+
+% From directory 'results-v2':
+% MNIST
+% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name d-cliques/random-cliques d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
+% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-22:12:59-CEST-labostrex114 all/2021-09-28-23:18:49-CEST-labostrex119 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 80 --ymax 92.5 --yaxis test-accuracy --labels 'd-cliques random' 'd-cliques greedy-swap' --save-figure ../mlsys2022style/figures/convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 3
+% CIFAR10
+%  python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name d-cliques/random-cliques d-cliques/greedy-swap --nodes:name 2-shards-eq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
+% python $TOOLS/analyze/diff.py --rundirs all/2021-10-04-21:18:33-CEST-labostrex117 all/2021-10-03-19:53:21-CEST-labostrex117 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'd-cliques random' 'd-cliques greedy-swap' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node.png --linestyles 'solid' 'dashed' --font-size 18 --linewidth 3
+
+\begin{figure}[htbp]
+    \centering        
+    \begin{subfigure}[b]{0.23\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{figures/convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node}
+    \caption{\label{fig:convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node} MNIST}
+    \end{subfigure}
+    \hfill
+    \begin{subfigure}[b]{0.23\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{figures/convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node}
+    \caption{\label{fig:convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node} CIFAR10}
+    \end{subfigure}
+\caption{\label{fig:convergence-speed-dc-random-vs-dc-gs-2-shards-per-node} Convergence Speed of D-Cliques constructed Randomly vs Greedy Swap on 100 Nodes (2 shards/node).}
+\end{figure}
+
--- a/mlsys2022style/figures/convergence-cifar10-random-vs-d-cliques-2-shards.png
+++ b/mlsys2022style/figures/convergence-cifar10-random-vs-d-cliques-2-shards.png
--- a/mlsys2022style/figures/convergence-mnist-random-vs-d-cliques-2-shards.png
+++ b/mlsys2022style/figures/convergence-mnist-random-vs-d-cliques-2-shards.png
--- a/mlsys2022style/figures/convergence-speed-cifar10-dc-fc-vs-fc-2-shards-per-node.png
+++ b/mlsys2022style/figures/convergence-speed-cifar10-dc-fc-vs-fc-2-shards-per-node.png
--- a/mlsys2022style/figures/convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node.png
+++ b/mlsys2022style/figures/convergence-speed-cifar10-dc-random-vs-dc-gs-2-shards-per-node.png
--- a/mlsys2022style/figures/convergence-speed-cifar10-w-c-avg-no-mom-vs-mom-2-shards-per-node.png
+++ b/mlsys2022style/figures/convergence-speed-cifar10-w-c-avg-no-mom-vs-mom-2-shards-per-node.png
--- a/mlsys2022style/figures/convergence-speed-cifar10-wo-c-avg-no-mom-vs-mom-2-shards-per-node.png
+++ b/mlsys2022style/figures/convergence-speed-cifar10-wo-c-avg-no-mom-vs-mom-2-shards-per-node.png
--- a/mlsys2022style/figures/convergence-speed-mnist-dc-fc-vs-fc-2-shards-per-node.png
+++ b/mlsys2022style/figures/convergence-speed-mnist-dc-fc-vs-fc-2-shards-per-node.png
--- a/mlsys2022style/figures/convergence-speed-mnist-dc-no-c-avg-vs-c-avg-2-shards-per-node.png
+++ b/mlsys2022style/figures/convergence-speed-mnist-dc-no-c-avg-vs-c-avg-2-shards-per-node.png
--- a/mlsys2022style/figures/convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node.png
+++ b/mlsys2022style/figures/convergence-speed-mnist-dc-random-vs-dc-gs-2-shards-per-node.png
--- a/mlsys2022style/figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png
+++ b/mlsys2022style/figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png
--- a/mlsys2022style/figures/d-cliques-cifar10-w-clique-avg-impact-of-edge-removal.png
+++ b/mlsys2022style/figures/d-cliques-cifar10-w-clique-avg-impact-of-edge-removal.png
--- a/mlsys2022style/figures/d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal.png
+++ b/mlsys2022style/figures/d-cliques-cifar10-wo-clique-avg-impact-of-edge-removal.png
--- a/mlsys2022style/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png
+++ b/mlsys2022style/figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png
--- a/mlsys2022style/figures/d-cliques-mnist-w-clique-avg-impact-of-edge-removal.png
+++ b/mlsys2022style/figures/d-cliques-mnist-w-clique-avg-impact-of-edge-removal.png
--- a/mlsys2022style/figures/d-cliques-mnist-wo-clique-avg-impact-of-edge-removal.png
+++ b/mlsys2022style/figures/d-cliques-mnist-wo-clique-avg-impact-of-edge-removal.png