\textit{Can we design sparse topologies with convergence
speed similar to a fully connected network for problems involving
many participants with label distribution skew?}
Moreover, common analysis techniques to compare the asymptotic convergence behavior
of different topologies assume the impact on gradients is bounded by an unknown constant (e.g.~\cite{ying2021exponential}), and therefore ignore it. For example, asymptotic analysis suggests that an expander graph, a sparse topology whose number of edges per node scales logarithmically with the number of nodes, achieves exact averaging~\cite{ying2021exponential}: if this were the case, this would remove the effect of the topology on distributed averaging and reduce D-SGD to FedSGD~\cite{mcmahan2016communication} regardless of how data is partitioned. However, in practice as we will show, not only is the effect of data heterogeneity sufficient to prevent exact averaging, a different topology using as many or less edges can converge faster. In the rest of this paper we therefore rely on rigorous, repeatable practical experiments, instead of asymptotic convergence analysis, to accurately compare the convergence speed of a wide diversity of topologies. This provides a stronger basis for quantifying the effect of data heterogeneity and should motivate future work in adapting analysis techniques to correctly include it.
% Moreover, common analysis techniques to compare the asymptotic convergence behavior
% of different topologies assume the impact on gradients is bounded by an unknown constant (e.g.~\cite{ying2021exponential}), and therefore ignore it. For example, asymptotic analysis suggests that an expander graph, a sparse topology whose number of edges per node scales logarithmically with the number of nodes, achieves exact averaging~\cite{ying2021exponential}: if this were the case, this would remove the effect of the topology on distributed averaging and reduce D-SGD to FedSGD~\cite{mcmahan2016communication} regardless of how data is partitioned. However, in practice as we will show, not only is the effect of data heterogeneity sufficient to prevent exact averaging, a different topology using as many or less edges can converge faster. In the rest of this paper we therefore rely on rigorous, repeatable practical experiments, instead of asymptotic convergence analysis, to accurately compare the convergence speed of a wide diversity of topologies. This provides a stronger basis for quantifying the effect of data heterogeneity and should motivate future work in adapting analysis techniques to correctly include it.
Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
...
...
@@ -132,16 +138,14 @@ optimizing local models, from distributed averaging, used to ensure that all
models converge, thereby reducing the bias introduced by inter-clique
connections;
(4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the heterogeneous setting; (5) We
demonstrate
through an extensive experimental study that our approach removes the effect
of label distribution skew when training a linear
model and a deep
convolutional network on the MNIST %~\cite{mnistWebsite}
and CIFAR10 % ~\cite{krizhevsky2009learning}
datasets respectively; (5) We show that D-Cliques converge faster than a static
undirected expander graph, otherwise thought to provide asymptotic exact averaging,
showing that data heterogeneity should not be ignored in convergence analysis;
that would otherwise be detrimental in the heterogeneous setting; (5) Through
an extensive experimental study on decentralized learning of linear
models and deep
convolutional networks on MNIST %~\cite{mnistWebsite}
and CIFAR10 datasets, % ~\cite{krizhevsky2009learning}
we validate our various design choices and
demonstrate that our approach is able to remove the effect
of label distribution skew while maintaining a sparse topology;
(6) Finally, we demonstrate the scalability of our
approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning which performs empirical
...
...
@@ -154,11 +158,15 @@ For instance, our results show that under strong label distribution skew,
using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) to obtain a similar convergence speed as a fully-connected topology,
thereby yielding a 96\% reduction in the total number of required messages
(37.8 messages per round per node on average instead of 999). Furthermore an additional 22\% improvement
(37.8 messages per round per node on average instead of 999). An additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further
potential gains at larger scales through a quasilinear $O(n
\log n)$ scaling in the number of nodes $n$.
We also show that D-Cliques empirically provide faster and more
robust convergence than random graphs
(such as the exponential graphs recently promoted in
\cite{ying2021exponential}) with a similar number of edges.
The rest of this paper is organized as follows.
We first describe the problem setting in Section~\ref{section:problem}. We
@@ -67,13 +67,19 @@ on sparse topologies like rings or grids as they do on a fully connected
network \cite{lian2017d-psgd,Lian2018}. Recent work
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the homogeneous case. However, these results do not give any clear insight
regarding the role of the topology in the presence of heterogeneous data.
regarding the role of the topology in the presence of heterogeneous data.
Indeed, in all of the above analyses, the impact of data
heterogeneity
is abstracted away through (unknown) constants
that bound the variance of local gradients.
We note that some work
has gone into designing efficient topologies to optimize the use of
has gone into designing topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but the topology is chosen
independently of how data is distributed across nodes. In summary, the role
of topology in the heterogeneous data scenario is not well understood and we are not
aware of prior work focusing on this question. Our work is the first
independently of how data is distributed across nodes.
In summary, the interplay between the network topology and data heterogeneity
is not well understood and we are not aware of prior work focusing on this
question. Our work is the first
to show that an
appropriate choice of data-dependent topology can effectively compensate for