rest

f7894b4a · aurelien.bellet · 771a3cb4 · f7894b4a · f7894b4a · f7894b4a
Commit f7894b4a authored 3 years ago by aurelien.bellet
--- a/icdcs22/exp.tex
+++ b/icdcs22/exp.tex
@@ -17,14 +17,29 @@ random cliques.
 \label{section:experimental-settings}

 Our main goal is to provide a fair comparison of the convergence speed across
-different topologies and algorithmic variations, in order to
+different topologies and algorithmic variations, in order to validate our
+design choices and
 show that D-Cliques can remove much of the effects of label distribution skew. 

-In contrast to common analyses and evaluations, we focus our investigation on concrete experiments, rather than asymptotic convergence analysis, and convergence speed, rather than final accuracy, for two reasons. First, when using asymptotic convergence analysis, the effect of heterogeneous data on the variance of gradients is typically abstracted
-with the use of bounded but unknown constants (e.g.~\cite{ying2021exponential}). As the next experiments
-show, a careful design of the topology significantly effect the convergence speed, which seems to indicate that the usually overlooked constants have a major impact. Second, depending on when training is stopped, the relative final accuracy difference when comparing alternative methods may vary significantly and may even lead to different conclusions. Instead of relying on somewhat arbitrary stopping points, we show the actual convergence curves of generalization performance, i.e. accuracy on the test set throughout training, up to a point where it is clear that one approach clearly converges faster and/or the other approaches are essentially similar, everything else being exactly the same. Formalizing the underlying factors is outside the scope of this paper, but shall be greatly informed by the concrete experiments that we present.
-
-We experiment with two datasets: MNIST~\cite{mnistWebsite} and
+In our study, we focus our investigation on the convergence speed, rather than
+the final accuracy after a fixed number of iterations.
+% we focus our investigation on concrete experiments, rather than asymptotic convergence analysis, and convergence speed, rather than final accuracy, for two reasons. First, when using asymptotic convergence analysis, the effect of heterogeneous data on the variance of gradients is typically abstracted
+% with the use of bounded but unknown constants (e.g.~\cite{ying2021exponential}). As the next experiments
+% show, a careful design of the topology significantly effect the convergence speed, which seems to indicate that the usually overlooked constants have a major impact.
+Indeed, depending on when training is stopped, the relative 
+difference in final accuracy across different algorithms may vary
+significantly and lead to different conclusions. Instead of relying on
+somewhat arbitrary stopping points, we show the convergence curves of
+generalization performance (i.e., the accuracy on the test set throughout
+training), up to a point where it is clear that the different approaches have
+converged, will not make significantly more progress, or essentially behave
+the same. 
+% approach
+% clearly converges faster and/or the other approaches are essentially similar, everything else being exactly the same.
+% Formalizing the underlying factors is outside the scope of this paper, but shall be greatly informed by the concrete experiments that we present.
+
+\paragraph{Datasets.} We experiment with two datasets: MNIST~
+\cite{mnistWebsite} and
 CIFAR10~\cite{krizhevsky2009learning}, which both have $L=10$ classes.
 For MNIST,  we use 50k and 10k examples from the original 60k training 
 set for training and validation respectively. We use all 10k examples of 
@@ -48,6 +63,7 @@ of up to 4 classes. However, most nodes will have examples of 2 classes.  The va
 of classes, as well as the varying distribution of examples within a single node, makes the task 
 of creating cliques with low skew nontrivial.

+\paragraph{Models}
 We
 use a logistic regression classifier for MNIST, which
 provides up to 92.5\% accuracy in the centralized setting.
@@ -63,7 +79,9 @@ validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
 MNIST and $0.002$ and $20$ for CIFAR10.
 For CIFAR10, we additionally use a momentum of $0.9$.

-We evaluate 100- and 1000-node networks by creating multiple models 
+\paragraph{Metrics}
+We evaluate 100- and 1000-node networks by
+creating multiple models
 in memory and simulating the exchange of messages between nodes.
 To ignore the impact of distributed execution strategies and system
 optimization techniques, we report the test accuracy of all nodes (min, max,
@@ -73,17 +91,20 @@ case of a single node sampling the full distribution.
 To further make results comparable across different number of nodes, we lower
 the batch size proportionally to the number of nodes added, and inversely,
 e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
-ensures the same number of model updates and averaging per epoch, which is
-important to have a fair comparison.\footnote{Updating and averaging models
+ensures the same number of model updates and averaging per epoch, allowing a
+fair comparison.\footnote{Updating and averaging models
 after every example can eliminate the impact of label distribution skew. However, the
 resulting communication overhead is impractical.}

-Finally, we compare our results against an ideal baseline:
-a fully-connected network topology with the same number of nodes. Other things being equivalent, any alternative approach using less edges will converge at the same speed or slower: \textit{this is therefore the most difficult and general baseline to compare against}.
+\paragraph{Baselines} We compare our results against an ideal
+baseline:
+a fully-connected network topology with the same number of nodes. All other
+things being equal, any other topology using less edges will converge
+at the same speed or slower: \textit{this is therefore the most difficult and general baseline to compare against}.
 This baseline is also essentially equivalent to a centralized (single) IID node using a batch size
 $n$ times bigger, where $n$ is the number of nodes.  Both a fully-connected network and a single IID node
 effectively optimize a single model and sample
-uniformly from the global distribution: both therefore remove entirely the
+uniformly from the global distribution: both thus remove entirely the
 effect of label distribution skew and of the network topology on the
 optimization. In practice, we prefer a
 fully-connected network because it
@@ -93,6 +114,9 @@ distribution.\footnote{We
 conjecture that an heterogeneous data partition in a fully-connected network may force 
 more balanced representation of all classes in the union of all mini-batches, leading to better convergence.}

+We also provide comparisons against other popular topologies, such as random
+graphs and exponential graphs \cite{ying2021exponential}.
+
 \subsection{D-Cliques Match the Convergence Speed of Fully-Connected with a
 Fraction of the Edges}
 \label{section:d-cliques-vs-fully-connected}
@@ -235,9 +259,10 @@ momentum, on D-Cliques (fully-connected) with 10 cliques of 10 heterogeneous nod
 \subsection{D-Cliques Converge Faster than Random and Expander Graphs}
 \label{section:d-cliques-vs-random-graphs}

-In this experiment, we compare D-Cliques to a random graph and an expander graph~\cite{ying2021exponential}.
+In this experiment, we compare D-Cliques to a random graph and an exponential
+graph~\cite{ying2021exponential}.

-For a random graph, we compare against similar 
+For the random graph, we use a similar 
 number of edges (10) per node to determine
 whether a simple sparse topology could work equally well. 
 To ensure a fair comparison, because a random graph does not support 
@@ -273,8 +298,17 @@ is detrimental, similar to D-Cliques without the use of Clique Averaging
 \caption{\label{fig:convergence-random-vs-d-cliques-2-shards} Comparison on 100 heterogeneous nodes between D-Cliques (fully-connected) with 10 cliques of size 10 and a random graph with 10 edges per node \textit{without} Clique Averaging or momentum.} 
 \end{figure}

-For an expander graph, we compare against the one used by Ying et al.~\cite{ying2021exponential}, that results in a large spectral gap and low communication cost, with claims that it asymptotically provides exact averaging while communicating with only a logarithmic number of neighbors. We use the deterministic construction of a static expander graph~\cite{ying2021exponential} adapted to our experimental setting: we use undirected edges because D-SGD requires symmetric connections, and use all 14 undirected connections induced by the static expander graph for every communication step instead of a single neighbor chosen in a round-robin fashion. \autoref{fig:convergence-expander-vs-d-cliques-2-shards} shows that D-Cliques converge faster, even if the expander graph has the benefit of 14 times more communication than used in~\cite{ying2021exponential} and 4 more edges per node than D-Cliques.
-
+For the exponential graph, we follow 
+% we compare against the one used by Ying et
+% al.~\cite{ying2021exponential}, that results in a large spectral gap and low communication cost, with claims that it asymptotically provides exact averaging while communicating with only a logarithmic number of neighbors.
+the deterministic construction of~\cite{ying2021exponential} and consider
+edges to be undirected as required by D-SGD, leading to 14 edges per node.
+% connections induced by
+% the static expander graph for every communication step instead of a single neighbor chosen in a round-robin fashion.
+\autoref{fig:convergence-expander-vs-d-cliques-2-shards} shows that D-Cliques
+converge faster, despite the exponential graph having slightly more edges.
+% has
+% the benefit of 14 times more communication than used in~\cite{ying2021exponential} and 4 more edges per node than D-Cliques.
 % From directory 'results-v2':
 % MNIST
 % python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name expander-graph d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
@@ -300,7 +334,8 @@ For an expander graph, we compare against the one used by Ying et al.~\cite{ying
 \caption{\label{fig:convergence-expander-vs-d-cliques-2-shards} Comparison on 100 heterogeneous nodes between D-Cliques (fully-connected) with 10 cliques of size 10 and a static expander graph with 14 undirected edges per node \textit{without} Clique Averaging or momentum.} 
 \end{figure}

-Moreover, D-Cliques converge similarly or faster than a random graph constructed with diverse neighborhoods, such that the neighborhood of each node has zero skew. \autoref{fig:convergence-random-vs-d-cliques-1-class-per-node} shows results for similar experiments as \autoref{fig:convergence-random-vs-d-cliques-2-shards}, with nodes having examples of only 1 class instead of up to 4, to make the creation of diverse neighborhoods with zero skew easier. Note that in this case the effect of data heterogeneity on convergence is actually stronger. On MNIST, a random graph with a diverse neighbourhood is marginally better than D-Cliques without Clique Averaging. We hypothesize that the slightly better mixing provided by the random graph, compared to cliques, provides a small edge. On CIFAR10, when using momentum, D-Cliques converge faster even without Clique Averaging. In this case, it seems the clustering has a more beneficial effect than the marginal gain in mixing of the random graph. For both MNIST and CIFAR10, the clustering provided by D-Cliques, in combination with Clique Averaging, provides faster convergence and lower variance than diverse neighborhoods. In practice, the difficulty of obtaining a random graph in which each node has a neighborhood with similar skew as D-Cliques, while not using more edges, actually makes the construction with a decentralized algorithm harder, because there are more constraints in selecting potential candidates for swapping.
+
+Moreover, D-Cliques converge similarly or faster than a random graph constructed with diverse neighborhoods, such that the neighborhood of each node has zero skew. \autoref{fig:convergence-random-vs-d-cliques-1-class-per-node} shows results for similar experiments as \autoref{fig:convergence-random-vs-d-cliques-2-shards}, with nodes having examples of only 1 class instead of up to 4, to make the creation of diverse neighborhoods with zero skew easier. Note that in this case the effect of data heterogeneity on convergence is actually stronger. On MNIST, a random graph with a diverse neighborhood is marginally better than D-Cliques without Clique Averaging. We hypothesize that the slightly better mixing provided by the random graph, compared to cliques, provides a small edge. On CIFAR10, when using momentum, D-Cliques converge faster even without Clique Averaging. In this case, it seems the clustering has a more beneficial effect than the marginal gain in mixing of the random graph. For both MNIST and CIFAR10, the clustering provided by D-Cliques, in combination with Clique Averaging, provides faster convergence and lower variance than diverse neighborhoods. In practice, the difficulty of obtaining a random graph in which each node has a neighborhood with similar skew as D-Cliques, while not using more edges, actually makes the construction with a decentralized algorithm harder, because there are more constraints in selecting potential candidates for swapping.

 % From directory 'results-v2':
 % MNIST
@@ -327,8 +362,14 @@ Moreover, D-Cliques converge similarly or faster than a random graph constructed
 \caption{\label{fig:convergence-random-vs-d-cliques-1-class-per-node} Comparison to variations of Random Graph with 10 edges per node on 100 nodes (variation of \autoref{fig:convergence-random-vs-d-cliques-2-shards} with 1 class/node instead of 2 shards/node as well as constraining the random graph so the neighborhood of each node has zero skew).} 
 \end{figure}

-These experiments show that a careful design of the topology is indeed necessary, especially regarding the skew in the neighborhood of each node. In addition, it appears that the "bounded data heterogeneity" assumption, typically used in convergence theorems to compare different algorithms and topologies~\cite{ying2021exponential}, overlooks the major impact that a lower constant, such as that obtained when ensuring clustered representative neighborhoods, may have on convergence speed.
-
+These experiments show that a careful design of the topology is indeed
+necessary, especially regarding the skew in the neighborhood of each node. In
+addition, it appears that the ``bounded data heterogeneity'' assumption,
+typically used in convergence theorems to compare different algorithms and
+topologies~\cite{lian2017d-psgd,Lian2018,neglia2020,ying2021exponential},
+overlooks the major role that the choice of topology (e.g., one that ensures
+clustered representative neighborhoods like D-Cliques) may have on the
+convergence speed.
 %% From directory 'results-v2':
 %% MNIST
 %% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name random-graph d-cliques/greedy-swap greedy-neighbourhood-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py

--- a/icdcs22/intro.tex
+++ b/icdcs22/intro.tex
@@ -108,8 +108,8 @@ SGD.\footnote{Unlike in centralized FL
 \cite{mcmahan2016communication,scaffold,quagmire}, this
 happens even when nodes perform a single local update before averaging the
 model with their neighbors.} This strong impact of
-data heterogeneity and its interplay with the choice of network
-topology is not explained by current theoretical
+data heterogeneity and its interplay with the choice of
+topology is not well explained by current theoretical
 analyses of
 decentralized FL, which model heterogeneity by some (unknown)
 constant that bounds the variance of local gradients, independently of the
@@ -164,9 +164,9 @@ is possible when using a small-world inter-clique topology, with further
 potential gains at larger scales through a quasilinear $O(n
 \log n)$ scaling in the number of nodes $n$.
 We also show that D-Cliques empirically provide faster and more
-robust convergence than random graphs
-(such as the exponential graphs recently promoted in 
-\cite{ying2021exponential}) with a similar number of edges.
+robust convergence than other topologies with a similar number of edges, such
+as random graphs and the exponential graphs recently promoted in 
+\cite{ying2021exponential}) .

 The rest of this paper is organized as follows.
 We first describe the problem setting in Section~\ref{section:problem}. We

--- a/icdcs22/related_work.tex
+++ b/icdcs22/related_work.tex
@@ -44,13 +44,13 @@ instabilities when run on topologies other than rings. When
 the rows and columns of $W$ do not exactly
 sum to $1$ (due to finite precision), these small differences get amplified by
 the proposed updates and make the algorithm diverge.}
-In contrast, D-Cliques focuses on the design of a sparse topology which is
-able to compensate for the effect of heterogeneous data and scales to large
+In contrast, we focus on the design of a sparse topology which compensates for
+the effect of heterogeneous data and scales to large
 networks. We do not modify the simple
 and efficient D-SGD
 algorithm \cite{lian2017d-psgd} beyond removing some neighbor
 contributions
-that otherwise bias the gradient direction.
+that otherwise bias the gradient~direction.

 \paragraph{Impact of topology in fully decentralized FL} It is well
 known
@@ -71,7 +71,7 @@ regarding the role of the topology in the presence of heterogeneous data.
 Indeed, in all of the above analyses, the impact of data
 heterogeneity
 is abstracted away  through (unknown) constants
-that bound the variance of local gradients.
+that bound the variance of local gradients, independently of the topology.
 We note that some work
 has gone into designing topologies to optimize the use of
 network resources (see e.g., \cite{marfoq}), but the topology is chosen