Skip to content
Snippets Groups Projects
Commit b8f98bcc authored by aurelien.bellet's avatar aurelien.bellet
Browse files

typo abstract intro

parent 91a7d3bb
No related branches found
No related tags found
No related merge requests found
......@@ -55,12 +55,14 @@ The convergence speed of machine learning models trained with Federated
Learning is significantly affected by non-independent and identically
distributed (non-IID) data partitions, even more so in a fully decentralized
setting without a central server. In this paper, we show that the impact of
\textit{local class bias} can be significantly reduced by carefully designing
\textit{local class bias}, an important type of data non-IIDness, can be
significantly reduced by carefully designing
the underlying communication topology. We present D-Cliques, a novel topology
that reduces gradient bias by grouping nodes in interconnected cliques such
that the local joint distribution in a clique is representative of the global
class distribution. We also show how to adapt the updates of decentralized SGD
to obtain unbiased gradients and effective momentum with D-Cliques. Our
to obtain unbiased gradients and implement an effective momentum with
D-Cliques. Our
empirical evaluation on MNIST and CIFAR10 demonstrates that our approach
provides similar convergence speed as a fully-connected topology with a
significant reduction in the number of edges and messages. In a 1000-node
......@@ -94,22 +96,22 @@ of participants to collaboratively train machine learning models
on their joint
data while keeping it where it has been produced. Not only does this avoid
the costs of moving data, but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
Yet, such data distribution introduces new challenges for learning systems, as
Yet, working with natural data distributions introduces new challenges for
learning systems, as
local datasets
reflect the usage and production patterns specific to each participant: they are
\emph{not} independent and identically distributed
(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
across local datasets \cite{quagmire}.
across local datasets \cite{kairouz2019advances,quagmire}.
Therefore, one of the key challenges in FL is to design algorithms that
can efficiently deal with such non-IID data
can efficiently deal with such non-IID data distributions
\cite{kairouz2019advances,fedprox,scaffold,quagmire}.
Federated learning algorithms can be classified into two categories depending
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending
them back the aggregated model \cite{mcmahan2016communication}. In contrast,
(\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary network topology
where participants communicate only with their direct neighbors
in the network. A classic example of such algorithms is Decentralized
......@@ -128,6 +130,24 @@ speed compared to using denser topologies.
% We also note that full decentralization can also provide benefits in terms of
% privacy protection \cite{amp_dec}.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
\begin{figure}[t]
\centering
......@@ -168,28 +188,8 @@ speed compared to using denser topologies.
\label{fig:iid-vs-non-iid-problem}
\end{figure}
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
%Indeed, as we show with the following contributions:
In this paper, we make the following contributions:
Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global
......@@ -198,18 +198,22 @@ the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections;
(3) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in a non-IID setting; (4) We demonstrate
that would otherwise be detrimental in the non-IID setting; (4) We
demonstrate
through an extensive experimental study that our approach removes the effect
of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~
of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network; (5) Finally, we demonstrate the scalability of our
approach by considering up to 1000 node networks, in contrast to most
approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
of nodes
\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
thereby yielding a 96\% reduction in the total number of required messages
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
......@@ -221,7 +225,7 @@ Section~\ref{section:clique-averaging-momentum}, we
show how to further reduce bias with Clique Averaging and how to use it to
implement momentum. We present the results or our extensive experimental
study in Section~\ref{section:non-clustered}. We review some related work in
Section~\ref{section:related-work}), and conclude with promising directions
Section~\ref{section:related-work}, and conclude with promising directions
for future work in Section~\ref{section:conclusion}.
%When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment