@@ -43,7 +48,8 @@ server may quickly become a bottleneck as the number of participants increases,
...
@@ -43,7 +48,8 @@ server may quickly become a bottleneck as the number of participants increases,
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree
\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically
\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids do not significantly affect the convergence
topologies like rings or grids
do not significantly affect the convergence
speed compared to using denser topologies.
speed compared to using denser topologies.
\begin{figure*}[ht]
\begin{figure*}[ht]
...
@@ -88,11 +94,10 @@ speed compared to using denser topologies.
...
@@ -88,11 +94,10 @@ speed compared to using denser topologies.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}, where we observe that using a
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
sparse topology (such as a ring or
label distribution skew, using a
a grid) clearly jeopardizes the convergence speed when local
sparse topology (a ring or
distributions do not have relative frequency of classes similar to the global
a grid) clearly jeopardizes the convergence speed of decentralized SGD.
distribution, i.e. they exhibit \textit{label distribution skew}\cite{kairouz2019advances}.
We stress the fact
We stress the fact
that, unlike in centralized FL
that, unlike in centralized FL
\cite{mcmahan2016communication,scaffold,quagmire}, this
\cite{mcmahan2016communication,scaffold,quagmire}, this
...
@@ -108,20 +113,24 @@ Specifically, we make the following contributions:
...
@@ -108,20 +113,24 @@ Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint label distribution of each clique is close to that of the global
the joint label distribution of each clique is close to that of the global
(IID) distribution; (2) We propose Greedy Swap, an algorithm for constructing
(IID) distribution; (2) We design a greedy algorithm for
such cliques efficiently in the presence of heterogeneity previously studied
constructing such cliques efficiently;
in the context of Federated Learning~\cite{mcmahan2016communication};
% in the presence of heterogeneity previously studied
(3) We propose Clique Averaging, a modified version of
% in the context of Federated Learning~\cite{mcmahan2016communication};
(3) We introduce Clique Averaging, a modified version of
the standard D-SGD algorithm which decouples gradient averaging, used for
the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
optimizing local models, from distributed averaging, used to ensure that all
converge, therefore reducing the bias introduced by inter-clique connections;
models converge, therefore reducing the bias introduced by inter-clique
connections;
(4) We show how Clique Averaging can be used to implement unbiased momentum
(4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the non-IID setting; (5) We
that would otherwise be detrimental in the non-IID setting; (5) We
demonstrate
demonstrate
through an extensive experimental study that our approach removes the effect
through an extensive experimental study that our approach removes the effect
of label distribution skew on the MNIST~\cite{mnistWebsite} and
of label distribution skew when training a linear
CIFAR10~\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
model and a deep
convolutional network; (6) Finally, we demonstrate the scalability of our
convolutional network on the MNIST %~\cite{mnistWebsite}
and CIFAR10 % ~\cite{krizhevsky2009learning}
datasets respectively ; (6) Finally, we demonstrate the scalability of our
approach by considering up to 1000-node networks, in contrast to most
approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
previous work on fully decentralized learning that considers only a few tens
of nodes
of nodes
...
@@ -132,7 +141,9 @@ requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
...
@@ -132,7 +141,9 @@ requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
thereby yielding a 96\% reduction in the total number of required messages
thereby yielding a 96\% reduction in the total number of required messages
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.
is possible when using a small-world inter-clique topology, with further
potential gains at larger scales through a quasilinear $O(n
\log n)$ scaling in the number of nodes $n$.
The rest of this paper is organized as follows \dots\todo{EL: Complete once structure stabilizes}
The rest of this paper is organized as follows \dots\todo{EL: Complete once structure stabilizes}