Federated learning algorithms can be classified into two categories depending
Federated learning algorithms can be classified into two categories depending
on the underlying network topology they run on. In server-based FL, the
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending
(\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast,
them back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary network topology
fully decentralized FL algorithms operate over an arbitrary network topology
where participants communicate only with their direct neighbors
where participants communicate only with their direct neighbors
in the network. A classic example of such algorithms is Decentralized
in the network. A classic example of such algorithms is Decentralized
...
@@ -128,6 +130,24 @@ speed compared to using denser topologies.
...
@@ -128,6 +130,24 @@ speed compared to using denser topologies.
% We also note that full decentralization can also provide benefits in terms of
% We also note that full decentralization can also provide benefits in terms of
% privacy protection \cite{amp_dec}.
% privacy protection \cite{amp_dec}.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
\begin{figure}[t]
\begin{figure}[t]
\centering
\centering
...
@@ -168,28 +188,8 @@ speed compared to using denser topologies.
...
@@ -168,28 +188,8 @@ speed compared to using denser topologies.
\label{fig:iid-vs-non-iid-problem}
\label{fig:iid-vs-non-iid-problem}
\end{figure}
\end{figure}
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
%Indeed, as we show with the following contributions:
%Indeed, as we show with the following contributions:
In this paper, we make the following contributions:
Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global
the joint data distribution of each clique is representative of the global
...
@@ -198,18 +198,22 @@ the standard D-SGD algorithm which decouples gradient averaging, used for
...
@@ -198,18 +198,22 @@ the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections;
converge, therefore reducing the bias introduced by inter-clique connections;
(3) We show how Clique Averaging can be used to implement unbiased momentum
(3) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in a non-IID setting; (4) We demonstrate
that would otherwise be detrimental in the non-IID setting; (4) We
demonstrate
through an extensive experimental study that our approach removes the effect
through an extensive experimental study that our approach removes the effect
of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~
of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network; (5) Finally, we demonstrate the scalability of our
convolutional network; (5) Finally, we demonstrate the scalability of our
approach by considering up to 1000node networks, in contrast to most
approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
previous work on fully decentralized learning that considers only a few tens
of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
For instance, our results show that using D-Cliques in a 1000-node network
For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
thereby yielding a 96\% reduction in the total number of required messages
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
...
@@ -221,7 +225,7 @@ Section~\ref{section:clique-averaging-momentum}, we
...
@@ -221,7 +225,7 @@ Section~\ref{section:clique-averaging-momentum}, we
show how to further reduce bias with Clique Averaging and how to use it to
show how to further reduce bias with Clique Averaging and how to use it to
implement momentum. We present the results or our extensive experimental
implement momentum. We present the results or our extensive experimental
study in Section~\ref{section:non-clustered}. We review some related work in
study in Section~\ref{section:non-clustered}. We review some related work in
Section~\ref{section:related-work}), and conclude with promising directions
Section~\ref{section:related-work}, and conclude with promising directions
for future work in Section~\ref{section:conclusion}.
for future work in Section~\ref{section:conclusion}.
%When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
%When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).