Federated learning algorithms can be classified into two categories depending
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending
them back the aggregated model \cite{mcmahan2016communication}. In contrast,
(\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary network topology
where participants communicate only with their direct neighbors
in the network. A classic example of such algorithms is Decentralized
...
...
@@ -128,6 +130,24 @@ speed compared to using denser topologies.
% We also note that full decentralization can also provide benefits in terms of
% privacy protection \cite{amp_dec}.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
\begin{figure}[t]
\centering
...
...
@@ -168,28 +188,8 @@ speed compared to using denser topologies.
\label{fig:iid-vs-non-iid-problem}
\end{figure}
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
%Indeed, as we show with the following contributions:
In this paper, we make the following contributions:
Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global
...
...
@@ -198,18 +198,22 @@ the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections;
(3) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in a non-IID setting; (4) We demonstrate
that would otherwise be detrimental in the non-IID setting; (4) We
demonstrate
through an extensive experimental study that our approach removes the effect
of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~
of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network; (5) Finally, we demonstrate the scalability of our
approach by considering up to 1000node networks, in contrast to most
approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
thereby yielding a 96\% reduction in the total number of required messages
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
...
...
@@ -221,7 +225,7 @@ Section~\ref{section:clique-averaging-momentum}, we
show how to further reduce bias with Clique Averaging and how to use it to
implement momentum. We present the results or our extensive experimental
study in Section~\ref{section:non-clustered}. We review some related work in
Section~\ref{section:related-work}), and conclude with promising directions
Section~\ref{section:related-work}, and conclude with promising directions
for future work in Section~\ref{section:conclusion}.
%When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).