Skip to content
Snippets Groups Projects
Commit c1840025 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

minor changes in intro

parent d897a0af
No related branches found
No related tags found
No related merge requests found
...@@ -3,24 +3,29 @@ ...@@ -3,24 +3,29 @@
\section{Introduction} \section{Introduction}
Machine learning is currently shifting from a \emph{centralized} Machine learning is currently shifting from a \emph{centralized}
paradigm, in which models are trained on data located on a single machine or paradigm, where training data is located on a single
in a data center, to \emph{decentralized} ones. machine or
Effectively, the latter paradigm closely matches the natural data distribution in a data center, to \emph{decentralized} ones in which data is processed
in the numerous use-cases where data is collected and processed by several where it was naturally produced.
independent This shift is illustrated by the rise of Federated
parties (hospitals, companies, personal devices...). Learning
Federated Learning (FL) allows a set (FL). FL allows
of participants to collaboratively train machine learning models several parties (hospitals, companies, personal
devices...) to collaboratively train machine learning models
on their joint on their joint
data while keeping it where it has been produced. Not only does this avoid data without centralizing it. Not only does FL
the costs of moving data, but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}. avoid the costs of moving data, but it also mitigates privacy and
confidentiality concerns~\cite{kairouz2019advances}.
Yet, working with natural data distributions introduces new challenges for Yet, working with natural data distributions introduces new challenges for
learning systems, as learning systems, as
local datasets local datasets
reflect the usage and production patterns specific to each participant: they are reflect the usage and production patterns specific to each participant: they are
\emph{not} independent and identically distributed \emph{not} independent and identically distributed
(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary (non-IID). In the context of classification problems, the
across local datasets \cite{kairouz2019advances,quagmire}. relative
frequency of different classes of examples may significantly vary
across local datasets, a situation known as \emph{label distribution skew}
\cite{kairouz2019advances,quagmire}.
Therefore, one of the key challenges in FL is to design algorithms that Therefore, one of the key challenges in FL is to design algorithms that
can efficiently deal with such non-IID data distributions can efficiently deal with such non-IID data distributions
\cite{kairouz2019advances,fedprox,scaffold,quagmire}. \cite{kairouz2019advances,fedprox,scaffold,quagmire}.
...@@ -43,7 +48,8 @@ server may quickly become a bottleneck as the number of participants increases, ...@@ -43,7 +48,8 @@ server may quickly become a bottleneck as the number of participants increases,
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree
\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically \cite{lian2017d-psgd}. For IID data, recent work has shown both empirically
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids do not significantly affect the convergence topologies like rings or grids
do not significantly affect the convergence
speed compared to using denser topologies. speed compared to using denser topologies.
\begin{figure*}[ht] \begin{figure*}[ht]
...@@ -88,11 +94,10 @@ speed compared to using denser topologies. ...@@ -88,11 +94,10 @@ speed compared to using denser topologies.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}, where we observe that using a in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
sparse topology (such as a ring or label distribution skew, using a
a grid) clearly jeopardizes the convergence speed when local sparse topology (a ring or
distributions do not have relative frequency of classes similar to the global a grid) clearly jeopardizes the convergence speed of decentralized SGD.
distribution, i.e. they exhibit \textit{label distribution skew} \cite{kairouz2019advances}.
We stress the fact We stress the fact
that, unlike in centralized FL that, unlike in centralized FL
\cite{mcmahan2016communication,scaffold,quagmire}, this \cite{mcmahan2016communication,scaffold,quagmire}, this
...@@ -108,20 +113,24 @@ Specifically, we make the following contributions: ...@@ -108,20 +113,24 @@ Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in (1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint label distribution of each clique is close to that of the global the joint label distribution of each clique is close to that of the global
(IID) distribution; (2) We propose Greedy Swap, an algorithm for constructing (IID) distribution; (2) We design a greedy algorithm for
such cliques efficiently in the presence of heterogeneity previously studied constructing such cliques efficiently;
in the context of Federated Learning~\cite{mcmahan2016communication}; % in the presence of heterogeneity previously studied
(3) We propose Clique Averaging, a modified version of % in the context of Federated Learning~\cite{mcmahan2016communication};
(3) We introduce Clique Averaging, a modified version of
the standard D-SGD algorithm which decouples gradient averaging, used for the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models optimizing local models, from distributed averaging, used to ensure that all
converge, therefore reducing the bias introduced by inter-clique connections; models converge, therefore reducing the bias introduced by inter-clique
connections;
(4) We show how Clique Averaging can be used to implement unbiased momentum (4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the non-IID setting; (5) We that would otherwise be detrimental in the non-IID setting; (5) We
demonstrate demonstrate
through an extensive experimental study that our approach removes the effect through an extensive experimental study that our approach removes the effect
of label distribution skew on the MNIST~\cite{mnistWebsite} and of label distribution skew when training a linear
CIFAR10~\cite{krizhevsky2009learning} datasets, for training a linear model and a deep model and a deep
convolutional network; (6) Finally, we demonstrate the scalability of our convolutional network on the MNIST %~\cite{mnistWebsite}
and CIFAR10 % ~\cite{krizhevsky2009learning}
datasets respectively ; (6) Finally, we demonstrate the scalability of our
approach by considering up to 1000-node networks, in contrast to most approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens previous work on fully decentralized learning that considers only a few tens
of nodes of nodes
...@@ -132,7 +141,9 @@ requires 98\% less edges ($18.9$ vs $999$ edges per participant on average), ...@@ -132,7 +141,9 @@ requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
thereby yielding a 96\% reduction in the total number of required messages thereby yielding a 96\% reduction in the total number of required messages
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement (37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9) % (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes. is possible when using a small-world inter-clique topology, with further
potential gains at larger scales through a quasilinear $O(n
\log n)$ scaling in the number of nodes $n$.
The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes} The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes}
%We first present the problem %We first present the problem
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment