Skip to content
Snippets Groups Projects
Commit 771a3cb4 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

intro / related work

parent 7d2064de
No related branches found
No related tags found
No related merge requests found
......@@ -25,7 +25,7 @@ other words, they are
federated classification problems, known as \emph{label distribution skew}
\cite{kairouz2019advances,quagmire}, occurs when the frequency of different
classes of examples varies significantly across local datasets.
One of the key challenges in FL is to design algorithms that
A key challenge in FL is to design algorithms that
can efficiently deal with such heterogeneous data distributions
\cite{kairouz2019advances,fedprox,scaffold,quagmire}.
......@@ -103,20 +103,26 @@ In contrast to the homogeneous case however, our experiments demonstrate that
This phenomenon is illustrated in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
label distribution skew, using a
sparse topology (a ring or
a grid) clearly jeopardizes the convergence speed of decentralized SGD.
We stress the fact
that, unlike in centralized FL
a grid) clearly jeopardizes the convergence speed of decentralized
SGD.\footnote{Unlike in centralized FL
\cite{mcmahan2016communication,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we thus address the following
model with their neighbors.} This strong impact of
data heterogeneity and its interplay with the choice of network
topology is not explained by current theoretical
analyses of
decentralized FL, which model heterogeneity by some (unknown)
constant that bounds the variance of local gradients, independently of the
topology \cite{lian2017d-psgd,Lian2018,neglia2020,ying2021exponential}
In this paper, we thus address the following
question:
\textit{Can we design sparse topologies with convergence
speed similar to a fully connected network for problems involving
many participants with label distribution skew?}
Moreover, common analysis techniques to compare the asymptotic convergence behavior
of different topologies assume the impact on gradients is bounded by an unknown constant (e.g.~\cite{ying2021exponential}), and therefore ignore it. For example, asymptotic analysis suggests that an expander graph, a sparse topology whose number of edges per node scales logarithmically with the number of nodes, achieves exact averaging~\cite{ying2021exponential}: if this were the case, this would remove the effect of the topology on distributed averaging and reduce D-SGD to FedSGD~\cite{mcmahan2016communication} regardless of how data is partitioned. However, in practice as we will show, not only is the effect of data heterogeneity sufficient to prevent exact averaging, a different topology using as many or less edges can converge faster. In the rest of this paper we therefore rely on rigorous, repeatable practical experiments, instead of asymptotic convergence analysis, to accurately compare the convergence speed of a wide diversity of topologies. This provides a stronger basis for quantifying the effect of data heterogeneity and should motivate future work in adapting analysis techniques to correctly include it.
% Moreover, common analysis techniques to compare the asymptotic convergence behavior
% of different topologies assume the impact on gradients is bounded by an unknown constant (e.g.~\cite{ying2021exponential}), and therefore ignore it. For example, asymptotic analysis suggests that an expander graph, a sparse topology whose number of edges per node scales logarithmically with the number of nodes, achieves exact averaging~\cite{ying2021exponential}: if this were the case, this would remove the effect of the topology on distributed averaging and reduce D-SGD to FedSGD~\cite{mcmahan2016communication} regardless of how data is partitioned. However, in practice as we will show, not only is the effect of data heterogeneity sufficient to prevent exact averaging, a different topology using as many or less edges can converge faster. In the rest of this paper we therefore rely on rigorous, repeatable practical experiments, instead of asymptotic convergence analysis, to accurately compare the convergence speed of a wide diversity of topologies. This provides a stronger basis for quantifying the effect of data heterogeneity and should motivate future work in adapting analysis techniques to correctly include it.
Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
......@@ -132,16 +138,14 @@ optimizing local models, from distributed averaging, used to ensure that all
models converge, thereby reducing the bias introduced by inter-clique
connections;
(4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the heterogeneous setting; (5) We
demonstrate
through an extensive experimental study that our approach removes the effect
of label distribution skew when training a linear
model and a deep
convolutional network on the MNIST %~\cite{mnistWebsite}
and CIFAR10 % ~\cite{krizhevsky2009learning}
datasets respectively; (5) We show that D-Cliques converge faster than a static
undirected expander graph, otherwise thought to provide asymptotic exact averaging,
showing that data heterogeneity should not be ignored in convergence analysis;
that would otherwise be detrimental in the heterogeneous setting; (5) Through
an extensive experimental study on decentralized learning of linear
models and deep
convolutional networks on MNIST %~\cite{mnistWebsite}
and CIFAR10 datasets, % ~\cite{krizhevsky2009learning}
we validate our various design choices and
demonstrate that our approach is able to remove the effect
of label distribution skew while maintaining a sparse topology;
(6) Finally, we demonstrate the scalability of our
approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning which performs empirical
......@@ -154,11 +158,15 @@ For instance, our results show that under strong label distribution skew,
using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) to obtain a similar convergence speed as a fully-connected topology,
thereby yielding a 96\% reduction in the total number of required messages
(37.8 messages per round per node on average instead of 999). Furthermore an additional 22\% improvement
(37.8 messages per round per node on average instead of 999). An additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further
potential gains at larger scales through a quasilinear $O(n
\log n)$ scaling in the number of nodes $n$.
We also show that D-Cliques empirically provide faster and more
robust convergence than random graphs
(such as the exponential graphs recently promoted in
\cite{ying2021exponential}) with a similar number of edges.
The rest of this paper is organized as follows.
We first describe the problem setting in Section~\ref{section:problem}. We
......
......@@ -100,7 +100,8 @@ cliques such that the label distribution in a clique is representative
of the global label distribution. We also show how to adapt the updates of
decentralized SGD
to obtain unbiased gradients and implement an effective momentum with
D-Cliques. Our extensive empirical evaluation on MNIST and CIFAR10 demonstrates that our approach
D-Cliques. Our extensive empirical evaluation on MNIST and CIFAR10
validates our design and demonstrates that our approach
achieves similar convergence speed as a fully-connected topology,
% , the latter providing the best possible convergence
% in a data heterogeneous setting,
......
......@@ -67,13 +67,19 @@ on sparse topologies like rings or grids as they do on a fully connected
network \cite{lian2017d-psgd,Lian2018}. Recent work
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the homogeneous case. However, these results do not give any clear insight
regarding the role of the topology in the presence of heterogeneous data.
regarding the role of the topology in the presence of heterogeneous data.
Indeed, in all of the above analyses, the impact of data
heterogeneity
is abstracted away through (unknown) constants
that bound the variance of local gradients.
We note that some work
has gone into designing efficient topologies to optimize the use of
has gone into designing topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but the topology is chosen
independently of how data is distributed across nodes. In summary, the role
of topology in the heterogeneous data scenario is not well understood and we are not
aware of prior work focusing on this question. Our work is the first
independently of how data is distributed across nodes.
In summary, the interplay between the network topology and data heterogeneity
is not well understood and we are not aware of prior work focusing on this
question. Our work is the first
to show that an
appropriate choice of data-dependent topology can effectively compensate for
heterogeneous data.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment