Skip to content
Snippets Groups Projects
Commit f7894b4a authored by aurelien.bellet's avatar aurelien.bellet
Browse files

rest

parent 771a3cb4
No related branches found
No related tags found
No related merge requests found
......@@ -17,14 +17,29 @@ random cliques.
\label{section:experimental-settings}
Our main goal is to provide a fair comparison of the convergence speed across
different topologies and algorithmic variations, in order to
different topologies and algorithmic variations, in order to validate our
design choices and
show that D-Cliques can remove much of the effects of label distribution skew.
In contrast to common analyses and evaluations, we focus our investigation on concrete experiments, rather than asymptotic convergence analysis, and convergence speed, rather than final accuracy, for two reasons. First, when using asymptotic convergence analysis, the effect of heterogeneous data on the variance of gradients is typically abstracted
with the use of bounded but unknown constants (e.g.~\cite{ying2021exponential}). As the next experiments
show, a careful design of the topology significantly effect the convergence speed, which seems to indicate that the usually overlooked constants have a major impact. Second, depending on when training is stopped, the relative final accuracy difference when comparing alternative methods may vary significantly and may even lead to different conclusions. Instead of relying on somewhat arbitrary stopping points, we show the actual convergence curves of generalization performance, i.e. accuracy on the test set throughout training, up to a point where it is clear that one approach clearly converges faster and/or the other approaches are essentially similar, everything else being exactly the same. Formalizing the underlying factors is outside the scope of this paper, but shall be greatly informed by the concrete experiments that we present.
We experiment with two datasets: MNIST~\cite{mnistWebsite} and
In our study, we focus our investigation on the convergence speed, rather than
the final accuracy after a fixed number of iterations.
% we focus our investigation on concrete experiments, rather than asymptotic convergence analysis, and convergence speed, rather than final accuracy, for two reasons. First, when using asymptotic convergence analysis, the effect of heterogeneous data on the variance of gradients is typically abstracted
% with the use of bounded but unknown constants (e.g.~\cite{ying2021exponential}). As the next experiments
% show, a careful design of the topology significantly effect the convergence speed, which seems to indicate that the usually overlooked constants have a major impact.
Indeed, depending on when training is stopped, the relative
difference in final accuracy across different algorithms may vary
significantly and lead to different conclusions. Instead of relying on
somewhat arbitrary stopping points, we show the convergence curves of
generalization performance (i.e., the accuracy on the test set throughout
training), up to a point where it is clear that the different approaches have
converged, will not make significantly more progress, or essentially behave
the same.
% approach
% clearly converges faster and/or the other approaches are essentially similar, everything else being exactly the same.
% Formalizing the underlying factors is outside the scope of this paper, but shall be greatly informed by the concrete experiments that we present.
\paragraph{Datasets.} We experiment with two datasets: MNIST~
\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $L=10$ classes.
For MNIST, we use 50k and 10k examples from the original 60k training
set for training and validation respectively. We use all 10k examples of
......@@ -48,6 +63,7 @@ of up to 4 classes. However, most nodes will have examples of 2 classes. The va
of classes, as well as the varying distribution of examples within a single node, makes the task
of creating cliques with low skew nontrivial.
\paragraph{Models}
We
use a logistic regression classifier for MNIST, which
provides up to 92.5\% accuracy in the centralized setting.
......@@ -63,7 +79,9 @@ validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
For CIFAR10, we additionally use a momentum of $0.9$.
We evaluate 100- and 1000-node networks by creating multiple models
\paragraph{Metrics}
We evaluate 100- and 1000-node networks by
creating multiple models
in memory and simulating the exchange of messages between nodes.
To ignore the impact of distributed execution strategies and system
optimization techniques, we report the test accuracy of all nodes (min, max,
......@@ -73,17 +91,20 @@ case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is
important to have a fair comparison.\footnote{Updating and averaging models
ensures the same number of model updates and averaging per epoch, allowing a
fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of label distribution skew. However, the
resulting communication overhead is impractical.}
Finally, we compare our results against an ideal baseline:
a fully-connected network topology with the same number of nodes. Other things being equivalent, any alternative approach using less edges will converge at the same speed or slower: \textit{this is therefore the most difficult and general baseline to compare against}.
\paragraph{Baselines} We compare our results against an ideal
baseline:
a fully-connected network topology with the same number of nodes. All other
things being equal, any other topology using less edges will converge
at the same speed or slower: \textit{this is therefore the most difficult and general baseline to compare against}.
This baseline is also essentially equivalent to a centralized (single) IID node using a batch size
$n$ times bigger, where $n$ is the number of nodes. Both a fully-connected network and a single IID node
effectively optimize a single model and sample
uniformly from the global distribution: both therefore remove entirely the
uniformly from the global distribution: both thus remove entirely the
effect of label distribution skew and of the network topology on the
optimization. In practice, we prefer a
fully-connected network because it
......@@ -93,6 +114,9 @@ distribution.\footnote{We
conjecture that an heterogeneous data partition in a fully-connected network may force
more balanced representation of all classes in the union of all mini-batches, leading to better convergence.}
We also provide comparisons against other popular topologies, such as random
graphs and exponential graphs \cite{ying2021exponential}.
\subsection{D-Cliques Match the Convergence Speed of Fully-Connected with a
Fraction of the Edges}
\label{section:d-cliques-vs-fully-connected}
......@@ -235,9 +259,10 @@ momentum, on D-Cliques (fully-connected) with 10 cliques of 10 heterogeneous nod
\subsection{D-Cliques Converge Faster than Random and Expander Graphs}
\label{section:d-cliques-vs-random-graphs}
In this experiment, we compare D-Cliques to a random graph and an expander graph~\cite{ying2021exponential}.
In this experiment, we compare D-Cliques to a random graph and an exponential
graph~\cite{ying2021exponential}.
For a random graph, we compare against similar
For the random graph, we use a similar
number of edges (10) per node to determine
whether a simple sparse topology could work equally well.
To ensure a fair comparison, because a random graph does not support
......@@ -273,8 +298,17 @@ is detrimental, similar to D-Cliques without the use of Clique Averaging
\caption{\label{fig:convergence-random-vs-d-cliques-2-shards} Comparison on 100 heterogeneous nodes between D-Cliques (fully-connected) with 10 cliques of size 10 and a random graph with 10 edges per node \textit{without} Clique Averaging or momentum.}
\end{figure}
For an expander graph, we compare against the one used by Ying et al.~\cite{ying2021exponential}, that results in a large spectral gap and low communication cost, with claims that it asymptotically provides exact averaging while communicating with only a logarithmic number of neighbors. We use the deterministic construction of a static expander graph~\cite{ying2021exponential} adapted to our experimental setting: we use undirected edges because D-SGD requires symmetric connections, and use all 14 undirected connections induced by the static expander graph for every communication step instead of a single neighbor chosen in a round-robin fashion. \autoref{fig:convergence-expander-vs-d-cliques-2-shards} shows that D-Cliques converge faster, even if the expander graph has the benefit of 14 times more communication than used in~\cite{ying2021exponential} and 4 more edges per node than D-Cliques.
For the exponential graph, we follow
% we compare against the one used by Ying et
% al.~\cite{ying2021exponential}, that results in a large spectral gap and low communication cost, with claims that it asymptotically provides exact averaging while communicating with only a logarithmic number of neighbors.
the deterministic construction of~\cite{ying2021exponential} and consider
edges to be undirected as required by D-SGD, leading to 14 edges per node.
% connections induced by
% the static expander graph for every communication step instead of a single neighbor chosen in a round-robin fashion.
\autoref{fig:convergence-expander-vs-d-cliques-2-shards} shows that D-Cliques
converge faster, despite the exponential graph having slightly more edges.
% has
% the benefit of 14 times more communication than used in~\cite{ying2021exponential} and 4 more edges per node than D-Cliques.
% From directory 'results-v2':
% MNIST
% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name expander-graph d-cliques/greedy-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
......@@ -300,7 +334,8 @@ For an expander graph, we compare against the one used by Ying et al.~\cite{ying
\caption{\label{fig:convergence-expander-vs-d-cliques-2-shards} Comparison on 100 heterogeneous nodes between D-Cliques (fully-connected) with 10 cliques of size 10 and a static expander graph with 14 undirected edges per node \textit{without} Clique Averaging or momentum.}
\end{figure}
Moreover, D-Cliques converge similarly or faster than a random graph constructed with diverse neighborhoods, such that the neighborhood of each node has zero skew. \autoref{fig:convergence-random-vs-d-cliques-1-class-per-node} shows results for similar experiments as \autoref{fig:convergence-random-vs-d-cliques-2-shards}, with nodes having examples of only 1 class instead of up to 4, to make the creation of diverse neighborhoods with zero skew easier. Note that in this case the effect of data heterogeneity on convergence is actually stronger. On MNIST, a random graph with a diverse neighbourhood is marginally better than D-Cliques without Clique Averaging. We hypothesize that the slightly better mixing provided by the random graph, compared to cliques, provides a small edge. On CIFAR10, when using momentum, D-Cliques converge faster even without Clique Averaging. In this case, it seems the clustering has a more beneficial effect than the marginal gain in mixing of the random graph. For both MNIST and CIFAR10, the clustering provided by D-Cliques, in combination with Clique Averaging, provides faster convergence and lower variance than diverse neighborhoods. In practice, the difficulty of obtaining a random graph in which each node has a neighborhood with similar skew as D-Cliques, while not using more edges, actually makes the construction with a decentralized algorithm harder, because there are more constraints in selecting potential candidates for swapping.
Moreover, D-Cliques converge similarly or faster than a random graph constructed with diverse neighborhoods, such that the neighborhood of each node has zero skew. \autoref{fig:convergence-random-vs-d-cliques-1-class-per-node} shows results for similar experiments as \autoref{fig:convergence-random-vs-d-cliques-2-shards}, with nodes having examples of only 1 class instead of up to 4, to make the creation of diverse neighborhoods with zero skew easier. Note that in this case the effect of data heterogeneity on convergence is actually stronger. On MNIST, a random graph with a diverse neighborhood is marginally better than D-Cliques without Clique Averaging. We hypothesize that the slightly better mixing provided by the random graph, compared to cliques, provides a small edge. On CIFAR10, when using momentum, D-Cliques converge faster even without Clique Averaging. In this case, it seems the clustering has a more beneficial effect than the marginal gain in mixing of the random graph. For both MNIST and CIFAR10, the clustering provided by D-Cliques, in combination with Clique Averaging, provides faster convergence and lower variance than diverse neighborhoods. In practice, the difficulty of obtaining a random graph in which each node has a neighborhood with similar skew as D-Cliques, while not using more edges, actually makes the construction with a decentralized algorithm harder, because there are more constraints in selecting potential candidates for swapping.
% From directory 'results-v2':
% MNIST
......@@ -327,8 +362,14 @@ Moreover, D-Cliques converge similarly or faster than a random graph constructed
\caption{\label{fig:convergence-random-vs-d-cliques-1-class-per-node} Comparison to variations of Random Graph with 10 edges per node on 100 nodes (variation of \autoref{fig:convergence-random-vs-d-cliques-2-shards} with 1 class/node instead of 2 shards/node as well as constraining the random graph so the neighborhood of each node has zero skew).}
\end{figure}
These experiments show that a careful design of the topology is indeed necessary, especially regarding the skew in the neighborhood of each node. In addition, it appears that the "bounded data heterogeneity" assumption, typically used in convergence theorems to compare different algorithms and topologies~\cite{ying2021exponential}, overlooks the major impact that a lower constant, such as that obtained when ensuring clustered representative neighborhoods, may have on convergence speed.
These experiments show that a careful design of the topology is indeed
necessary, especially regarding the skew in the neighborhood of each node. In
addition, it appears that the ``bounded data heterogeneity'' assumption,
typically used in convergence theorems to compare different algorithms and
topologies~\cite{lian2017d-psgd,Lian2018,neglia2020,ying2021exponential},
overlooks the major role that the choice of topology (e.g., one that ensures
clustered representative neighborhoods like D-Cliques) may have on the
convergence speed.
%% From directory 'results-v2':
%% MNIST
%% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name random-graph d-cliques/greedy-swap greedy-neighbourhood-swap --nodes:name 2-shards-uneq-classes --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
......
......@@ -108,8 +108,8 @@ SGD.\footnote{Unlike in centralized FL
\cite{mcmahan2016communication,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors.} This strong impact of
data heterogeneity and its interplay with the choice of network
topology is not explained by current theoretical
data heterogeneity and its interplay with the choice of
topology is not well explained by current theoretical
analyses of
decentralized FL, which model heterogeneity by some (unknown)
constant that bounds the variance of local gradients, independently of the
......@@ -164,9 +164,9 @@ is possible when using a small-world inter-clique topology, with further
potential gains at larger scales through a quasilinear $O(n
\log n)$ scaling in the number of nodes $n$.
We also show that D-Cliques empirically provide faster and more
robust convergence than random graphs
(such as the exponential graphs recently promoted in
\cite{ying2021exponential}) with a similar number of edges.
robust convergence than other topologies with a similar number of edges, such
as random graphs and the exponential graphs recently promoted in
\cite{ying2021exponential}) .
The rest of this paper is organized as follows.
We first describe the problem setting in Section~\ref{section:problem}. We
......
......@@ -44,13 +44,13 @@ instabilities when run on topologies other than rings. When
the rows and columns of $W$ do not exactly
sum to $1$ (due to finite precision), these small differences get amplified by
the proposed updates and make the algorithm diverge.}
In contrast, D-Cliques focuses on the design of a sparse topology which is
able to compensate for the effect of heterogeneous data and scales to large
In contrast, we focus on the design of a sparse topology which compensates for
the effect of heterogeneous data and scales to large
networks. We do not modify the simple
and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions
that otherwise bias the gradient direction.
that otherwise bias the gradient~direction.
\paragraph{Impact of topology in fully decentralized FL} It is well
known
......@@ -71,7 +71,7 @@ regarding the role of the topology in the presence of heterogeneous data.
Indeed, in all of the above analyses, the impact of data
heterogeneity
is abstracted away through (unknown) constants
that bound the variance of local gradients.
that bound the variance of local gradients, independently of the topology.
We note that some work
has gone into designing topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but the topology is chosen
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment