Our main goal is to provide a fair comparison of the convergence speed across
different topologies and algorithmic variations, in order to
different topologies and algorithmic variations, in order to validate our
design choices and
show that D-Cliques can remove much of the effects of label distribution skew.
In contrast to common analyses and evaluations, we focus our investigation on concrete experiments, rather than asymptotic convergence analysis, and convergence speed, rather than final accuracy, for two reasons. First, when using asymptotic convergence analysis, the effect of heterogeneous data on the variance of gradients is typically abstracted
with the use of bounded but unknown constants (e.g.~\cite{ying2021exponential}). As the next experiments
show, a careful design of the topology significantly effect the convergence speed, which seems to indicate that the usually overlooked constants have a major impact. Second, depending on when training is stopped, the relative final accuracy difference when comparing alternative methods may vary significantly and may even lead to different conclusions. Instead of relying on somewhat arbitrary stopping points, we show the actual convergence curves of generalization performance, i.e. accuracy on the test set throughout training, up to a point where it is clear that one approach clearly converges faster and/or the other approaches are essentially similar, everything else being exactly the same. Formalizing the underlying factors is outside the scope of this paper, but shall be greatly informed by the concrete experiments that we present.
We experiment with two datasets: MNIST~\cite{mnistWebsite} and
In our study, we focus our investigation on the convergence speed, rather than
the final accuracy after a fixed number of iterations.
% we focus our investigation on concrete experiments, rather than asymptotic convergence analysis, and convergence speed, rather than final accuracy, for two reasons. First, when using asymptotic convergence analysis, the effect of heterogeneous data on the variance of gradients is typically abstracted
% with the use of bounded but unknown constants (e.g.~\cite{ying2021exponential}). As the next experiments
% show, a careful design of the topology significantly effect the convergence speed, which seems to indicate that the usually overlooked constants have a major impact.
Indeed, depending on when training is stopped, the relative
difference in final accuracy across different algorithms may vary
significantly and lead to different conclusions. Instead of relying on
somewhat arbitrary stopping points, we show the convergence curves of
generalization performance (i.e., the accuracy on the test set throughout
training), up to a point where it is clear that the different approaches have
converged, will not make significantly more progress, or essentially behave
the same.
% approach
% clearly converges faster and/or the other approaches are essentially similar, everything else being exactly the same.
% Formalizing the underlying factors is outside the scope of this paper, but shall be greatly informed by the concrete experiments that we present.
\paragraph{Datasets.} We experiment with two datasets: MNIST~
\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $L=10$ classes.
For MNIST, we use 50k and 10k examples from the original 60k training
set for training and validation respectively. We use all 10k examples of
...
...
@@ -48,6 +63,7 @@ of up to 4 classes. However, most nodes will have examples of 2 classes. The va
of classes, as well as the varying distribution of examples within a single node, makes the task
of creating cliques with low skew nontrivial.
\paragraph{Models}
We
use a logistic regression classifier for MNIST, which
provides up to 92.5\% accuracy in the centralized setting.
...
...
@@ -63,7 +79,9 @@ validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
For CIFAR10, we additionally use a momentum of $0.9$.
We evaluate 100- and 1000-node networks by creating multiple models
\paragraph{Metrics}
We evaluate 100- and 1000-node networks by
creating multiple models
in memory and simulating the exchange of messages between nodes.
To ignore the impact of distributed execution strategies and system
optimization techniques, we report the test accuracy of all nodes (min, max,
...
...
@@ -73,17 +91,20 @@ case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is
important to have a fair comparison.\footnote{Updating and averaging models
ensures the same number of model updates and averaging per epoch, allowing a
fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of label distribution skew. However, the
resulting communication overhead is impractical.}
Finally, we compare our results against an ideal baseline:
a fully-connected network topology with the same number of nodes. Other things being equivalent, any alternative approach using less edges will converge at the same speed or slower: \textit{this is therefore the most difficult and general baseline to compare against}.
\paragraph{Baselines} We compare our results against an ideal
baseline:
a fully-connected network topology with the same number of nodes. All other
things being equal, any other topology using less edges will converge
at the same speed or slower: \textit{this is therefore the most difficult and general baseline to compare against}.
This baseline is also essentially equivalent to a centralized (single) IID node using a batch size
$n$ times bigger, where $n$ is the number of nodes. Both a fully-connected network and a single IID node
effectively optimize a single model and sample
uniformly from the global distribution: both therefore remove entirely the
uniformly from the global distribution: both thus remove entirely the
effect of label distribution skew and of the network topology on the
optimization. In practice, we prefer a
fully-connected network because it
...
...
@@ -93,6 +114,9 @@ distribution.\footnote{We
conjecture that an heterogeneous data partition in a fully-connected network may force
more balanced representation of all classes in the union of all mini-batches, leading to better convergence.}
We also provide comparisons against other popular topologies, such as random
graphs and exponential graphs \cite{ying2021exponential}.
\subsection{D-Cliques Match the Convergence Speed of Fully-Connected with a
Fraction of the Edges}
\label{section:d-cliques-vs-fully-connected}
...
...
@@ -235,9 +259,10 @@ momentum, on D-Cliques (fully-connected) with 10 cliques of 10 heterogeneous nod
\subsection{D-Cliques Converge Faster than Random and Expander Graphs}
\label{section:d-cliques-vs-random-graphs}
In this experiment, we compare D-Cliques to a random graph and an expander graph~\cite{ying2021exponential}.
In this experiment, we compare D-Cliques to a random graph and an exponential
graph~\cite{ying2021exponential}.
For a random graph, we compare against similar
For the random graph, we use a similar
number of edges (10) per node to determine
whether a simple sparse topology could work equally well.
To ensure a fair comparison, because a random graph does not support
...
...
@@ -273,8 +298,17 @@ is detrimental, similar to D-Cliques without the use of Clique Averaging
\caption{\label{fig:convergence-random-vs-d-cliques-2-shards} Comparison on 100 heterogeneous nodes between D-Cliques (fully-connected) with 10 cliques of size 10 and a random graph with 10 edges per node \textit{without} Clique Averaging or momentum.}
\end{figure}
For an expander graph, we compare against the one used by Ying et al.~\cite{ying2021exponential}, that results in a large spectral gap and low communication cost, with claims that it asymptotically provides exact averaging while communicating with only a logarithmic number of neighbors. We use the deterministic construction of a static expander graph~\cite{ying2021exponential} adapted to our experimental setting: we use undirected edges because D-SGD requires symmetric connections, and use all 14 undirected connections induced by the static expander graph for every communication step instead of a single neighbor chosen in a round-robin fashion. \autoref{fig:convergence-expander-vs-d-cliques-2-shards} shows that D-Cliques converge faster, even if the expander graph has the benefit of 14 times more communication than used in~\cite{ying2021exponential} and 4 more edges per node than D-Cliques.
For the exponential graph, we follow
% we compare against the one used by Ying et
% al.~\cite{ying2021exponential}, that results in a large spectral gap and low communication cost, with claims that it asymptotically provides exact averaging while communicating with only a logarithmic number of neighbors.
the deterministic construction of~\cite{ying2021exponential} and consider
edges to be undirected as required by D-SGD, leading to 14 edges per node.
% connections induced by
% the static expander graph for every communication step instead of a single neighbor chosen in a round-robin fashion.
\autoref{fig:convergence-expander-vs-d-cliques-2-shards} shows that D-Cliques
converge faster, despite the exponential graph having slightly more edges.
% has
% the benefit of 14 times more communication than used in~\cite{ying2021exponential} and 4 more edges per node than D-Cliques.
@@ -300,7 +334,8 @@ For an expander graph, we compare against the one used by Ying et al.~\cite{ying
\caption{\label{fig:convergence-expander-vs-d-cliques-2-shards} Comparison on 100 heterogeneous nodes between D-Cliques (fully-connected) with 10 cliques of size 10 and a static expander graph with 14 undirected edges per node \textit{without} Clique Averaging or momentum.}
\end{figure}
Moreover, D-Cliques converge similarly or faster than a random graph constructed with diverse neighborhoods, such that the neighborhood of each node has zero skew. \autoref{fig:convergence-random-vs-d-cliques-1-class-per-node} shows results for similar experiments as \autoref{fig:convergence-random-vs-d-cliques-2-shards}, with nodes having examples of only 1 class instead of up to 4, to make the creation of diverse neighborhoods with zero skew easier. Note that in this case the effect of data heterogeneity on convergence is actually stronger. On MNIST, a random graph with a diverse neighbourhood is marginally better than D-Cliques without Clique Averaging. We hypothesize that the slightly better mixing provided by the random graph, compared to cliques, provides a small edge. On CIFAR10, when using momentum, D-Cliques converge faster even without Clique Averaging. In this case, it seems the clustering has a more beneficial effect than the marginal gain in mixing of the random graph. For both MNIST and CIFAR10, the clustering provided by D-Cliques, in combination with Clique Averaging, provides faster convergence and lower variance than diverse neighborhoods. In practice, the difficulty of obtaining a random graph in which each node has a neighborhood with similar skew as D-Cliques, while not using more edges, actually makes the construction with a decentralized algorithm harder, because there are more constraints in selecting potential candidates for swapping.
Moreover, D-Cliques converge similarly or faster than a random graph constructed with diverse neighborhoods, such that the neighborhood of each node has zero skew. \autoref{fig:convergence-random-vs-d-cliques-1-class-per-node} shows results for similar experiments as \autoref{fig:convergence-random-vs-d-cliques-2-shards}, with nodes having examples of only 1 class instead of up to 4, to make the creation of diverse neighborhoods with zero skew easier. Note that in this case the effect of data heterogeneity on convergence is actually stronger. On MNIST, a random graph with a diverse neighborhood is marginally better than D-Cliques without Clique Averaging. We hypothesize that the slightly better mixing provided by the random graph, compared to cliques, provides a small edge. On CIFAR10, when using momentum, D-Cliques converge faster even without Clique Averaging. In this case, it seems the clustering has a more beneficial effect than the marginal gain in mixing of the random graph. For both MNIST and CIFAR10, the clustering provided by D-Cliques, in combination with Clique Averaging, provides faster convergence and lower variance than diverse neighborhoods. In practice, the difficulty of obtaining a random graph in which each node has a neighborhood with similar skew as D-Cliques, while not using more edges, actually makes the construction with a decentralized algorithm harder, because there are more constraints in selecting potential candidates for swapping.
% From directory 'results-v2':
% MNIST
...
...
@@ -327,8 +362,14 @@ Moreover, D-Cliques converge similarly or faster than a random graph constructed
\caption{\label{fig:convergence-random-vs-d-cliques-1-class-per-node} Comparison to variations of Random Graph with 10 edges per node on 100 nodes (variation of \autoref{fig:convergence-random-vs-d-cliques-2-shards} with 1 class/node instead of 2 shards/node as well as constraining the random graph so the neighborhood of each node has zero skew).}
\end{figure}
These experiments show that a careful design of the topology is indeed necessary, especially regarding the skew in the neighborhood of each node. In addition, it appears that the "bounded data heterogeneity" assumption, typically used in convergence theorems to compare different algorithms and topologies~\cite{ying2021exponential}, overlooks the major impact that a lower constant, such as that obtained when ensuring clustered representative neighborhoods, may have on convergence speed.
These experiments show that a careful design of the topology is indeed
necessary, especially regarding the skew in the neighborhood of each node. In
addition, it appears that the ``bounded data heterogeneity'' assumption,
typically used in convergence theorems to compare different algorithms and