minor changes in intro

c1840025 · aurelien.bellet · d897a0af · c1840025
Commit c1840025 authored 3 years ago by aurelien.bellet
--- a/mlsys2022style/intro.tex
+++ b/mlsys2022style/intro.tex
@@ -3,24 +3,29 @@
 \section{Introduction}
 Machine learning is currently shifting from a \emph{centralized}
-paradigm, in which models are trained on data located on a single machine or
+paradigm, where training data is located on a single
-in a data center, to \emph{decentralized} ones.
+machine or
-Effectively, the latter paradigm closely matches the natural data distribution
+in a data center, to \emph{decentralized} ones in which data is processed
-in the numerous use-cases where data is collected and processed by several
+where it was naturally produced.
-independent
+This shift is illustrated by the rise of Federated
-parties (hospitals, companies, personal devices...).
+Learning
-Federated Learning (FL) allows a set
+(FL). FL allows
-of participants to collaboratively train machine learning models
+several parties (hospitals, companies, personal
+devices...) to collaboratively train machine learning models
 on their joint
-data while keeping it where it has been produced. Not only does this avoid
+data without centralizing it. Not only does FL
-the costs of moving data, but it also  mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
+avoid the costs of moving data, but it also  mitigates privacy and
+confidentiality concerns~\cite{kairouz2019advances}.
 Yet, working with natural data distributions introduces new challenges for
 learning systems, as
 local datasets
 reflect the usage and production patterns specific to each participant: they are
 \emph{not} independent and identically distributed
-(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
+(non-IID). In the context of classification problems, the
-across local datasets \cite{kairouz2019advances,quagmire}.
+relative
+frequency of different classes of examples may significantly vary
+across local datasets, a situation known as \emph{label distribution skew} 
+\cite{kairouz2019advances,quagmire}.
 Therefore, one of the key challenges in FL is to design algorithms that
 can efficiently deal with such non-IID data distributions
 \cite{kairouz2019advances,fedprox,scaffold,quagmire}.
@@ -43,7 +48,8 @@ server may quickly become a bottleneck as the number of participants increases,
 enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree 
 \cite{lian2017d-psgd}. For IID data, recent work has shown both empirically 
 \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
-topologies like rings or grids do not significantly affect the convergence
+topologies like rings or grids
+do not significantly affect the convergence
 speed compared to using denser topologies.
 \begin{figure*}[ht]
@@ -88,11 +94,10 @@ speed compared to using denser topologies.
 In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
-in Figure~\ref{fig:iid-vs-non-iid-problem}, where we observe that using a
+in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
-sparse topology (such as a ring or
+label distribution skew, using a
-a grid) clearly jeopardizes the convergence speed when local
+sparse topology (a ring or
-distributions do not have relative frequency of classes similar to the global
+a grid) clearly jeopardizes the convergence speed of decentralized SGD.
-distribution, i.e. they exhibit \textit{label distribution skew} \cite{kairouz2019advances}.
 We stress the fact
 that, unlike in centralized FL
 \cite{mcmahan2016communication,scaffold,quagmire}, this
@@ -108,20 +113,24 @@ Specifically, we make the following contributions:
 (1) We propose D-Cliques, a sparse topology in which nodes are organized in
 interconnected cliques, i.e. locally fully-connected sets of nodes, such that
 the joint label distribution of each clique is close to that of the global 
-(IID) distribution; (2) We propose Greedy Swap, an algorithm for constructing
+(IID) distribution; (2) We design a greedy algorithm for
-such cliques efficiently in the presence of heterogeneity previously studied
+constructing such cliques efficiently;
-in the context of Federated Learning~\cite{mcmahan2016communication};
+% in the presence of heterogeneity previously studied
- (3) We propose Clique Averaging, a  modified version of 
+% in the context of Federated Learning~\cite{mcmahan2016communication};
+ (3) We introduce Clique Averaging, a modified version of 
 the standard D-SGD algorithm which decouples gradient averaging, used for
-optimizing local models, from distributed averaging, used to ensure all models
+optimizing local models, from distributed averaging, used to ensure that all
-converge, therefore reducing the bias introduced by inter-clique connections; 
+models converge, therefore reducing the bias introduced by inter-clique
+connections; 
 (4) We show how Clique Averaging can be used to implement unbiased momentum
 that would otherwise be detrimental in the non-IID setting; (5) We 
 demonstrate
 through an extensive experimental study that our approach  removes the effect
-of label distribution skew on the MNIST~\cite{mnistWebsite} and
+of label distribution skew when training a linear
-CIFAR10~\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
+model and a deep
-convolutional network;  (6) Finally, we demonstrate the scalability of our
+convolutional network on the MNIST %~\cite{mnistWebsite}
+and CIFAR10 % ~\cite{krizhevsky2009learning}
+datasets respectively ;  (6) Finally, we demonstrate the scalability of our
 approach by considering  up to 1000-node networks, in contrast to most
 previous work on fully decentralized learning that considers only a few tens
 of nodes
@@ -132,7 +141,9 @@ requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
 thereby yielding a 96\% reduction in the total number of required messages 
 (37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
 % (14.5 edges per node on average instead of 18.9)
-is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.
+is possible when using a small-world inter-clique topology, with further
+potential gains at larger scales through a quasilinear $O(n
+\log n)$ scaling in the number of nodes $n$.
 The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes}
 %We first present the problem