typo abstract intro

b8f98bcc · aurelien.bellet · 91a7d3bb · b8f98bcc
Commit b8f98bcc authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -55,12 +55,14 @@ The convergence speed of machine learning models trained with Federated
 Learning is significantly affected by non-independent and identically
 distributed (non-IID) data partitions, even more so in a fully decentralized
 setting without a central server. In this paper, we show that the impact of
-\textit{local class bias} can be significantly reduced by carefully designing
+\textit{local class bias}, an important type of data non-IIDness, can be
+significantly reduced by carefully designing
 the underlying communication topology. We present D-Cliques, a novel topology
 that reduces gradient bias by grouping nodes in interconnected cliques such
 that the local joint distribution in a clique is representative of the global
 class distribution. We also show how to adapt the updates of decentralized SGD
-to obtain unbiased gradients and effective momentum with D-Cliques. Our
+to obtain unbiased gradients and implement an effective momentum with
+D-Cliques. Our
 empirical evaluation on MNIST and CIFAR10 demonstrates that our approach
 provides similar convergence speed as a fully-connected topology with a
 significant reduction in the number of edges and messages. In a 1000-node
@@ -94,22 +96,22 @@ of participants to collaboratively train machine learning models
 on their joint
 data while keeping it where it has been produced. Not only does this avoid
 the costs of moving data, but it also  mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
-Yet, such data distribution introduces new challenges for learning systems, as
+Yet, working with natural data distributions introduces new challenges for
+learning systems, as
 local datasets
 reflect the usage and production patterns specific to each participant: they are
 \emph{not} independent and identically distributed
 (non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
-across local datasets \cite{quagmire}.
+across local datasets \cite{kairouz2019advances,quagmire}.
 Therefore, one of the key challenges in FL is to design algorithms that
-can efficiently deal with such non-IID data 
+can efficiently deal with such non-IID data distributions
 \cite{kairouz2019advances,fedprox,scaffold,quagmire}.
 Federated learning algorithms can be classified into two categories depending
 on the underlying network topology they run on. In server-based FL, the
 network is organized according to a star topology: a central server orchestrates the training process by
 iteratively aggregating model updates received from the participants
-(\emph{clients}) and sending
+(\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast,
-them back the aggregated model \cite{mcmahan2016communication}. In contrast,
 fully decentralized FL algorithms operate over an arbitrary network topology
 where participants communicate only with their direct neighbors
 in the network. A classic example of such algorithms is Decentralized
@@ -128,6 +130,24 @@ speed compared to using denser topologies.
 % We also note that full decentralization can also provide benefits in terms of
 % privacy protection \cite{amp_dec}.
+In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
+in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that  a ring or
+a grid topology clearly jeopardizes the convergence speed as local
+distributions do not have relative frequency of classes similar to the global
+distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
+that, unlike in centralized FL
+\cite{kairouz2019advances,scaffold,quagmire}, this
+happens even when nodes perform a single local update before averaging the
+model with their neighbors. In this paper, we address the following question:
+% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
+%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
+\textit{Can we design sparse topologies with  convergence
+  speed similar to the one obtained in a  fully connected network under
+  a large number of participants with local class bias?}
+%AMK: do we talk about local class bias or noniidness?
 \begin{figure}[t]
     \centering
@@ -168,28 +188,8 @@ speed compared to using denser topologies.
        \label{fig:iid-vs-non-iid-problem}
 \end{figure}
-In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
-in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that  a ring or
-a grid topology clearly jeopardizes the convergence speed as local
-distributions do not have relative frequency of classes similar to the global
-distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL
-\cite{kairouz2019advances,scaffold,quagmire}, this
-happens even when nodes perform a single local update before averaging the
-model with their neighbors. In this paper, we address the following question:
-% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
-%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
-\textit{Can we design sparse topologies with  convergence
-  speed similar to the one obtained in a  fully connected network under
-  a large number of participants with local class bias?}
-%AMK: do we talk about local class bias or noniidness?
 %Indeed, as we show with the following contributions:
-In this paper, we make the following contributions:
+Specifically, we make the following contributions:
 (1) We propose D-Cliques, a sparse topology in which nodes are organized in
 interconnected cliques, i.e. locally fully-connected sets of nodes, such that
 the joint data distribution of each clique is representative of the global 
@@ -198,18 +198,22 @@ the standard D-SGD algorithm which decouples gradient averaging, used for
 optimizing local models, from distributed averaging, used to ensure all models
 converge, therefore reducing the bias introduced by inter-clique connections; 
 (3) We show how Clique Averaging can be used to implement unbiased momentum
-that would otherwise be detrimental in a non-IID setting; (4) We  demonstrate
+that would otherwise be detrimental in the non-IID setting; (4) We 
+demonstrate
 through an extensive experimental study that our approach  removes the effect
-of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~
+of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
 \cite{krizhevsky2009learning} datasets, for training a linear model and a deep
 convolutional network;  (5) Finally, we demonstrate the scalability of our
-approach by considering  up to 1000 node networks, in contrast to most
+approach by considering  up to 1000-node networks, in contrast to most
 previous work on fully decentralized learning that considers only a few tens
-of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
+of nodes
+\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
 %we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}. 
 For instance, our results show that using D-Cliques in a 1000-node network
-requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement 
+requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
+thereby yielding a 96\% reduction in the total number of required messages 
+(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
 % (14.5 edges per node on average instead of 18.9)
 is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
@@ -221,7 +225,7 @@ Section~\ref{section:clique-averaging-momentum}, we
 show how to further reduce bias with Clique Averaging and how to use it to
 implement momentum.  We present the results or our extensive experimental
 study in  Section~\ref{section:non-clustered}. We review some related work in
- Section~\ref{section:related-work}), and conclude with promising directions
+ Section~\ref{section:related-work}, and conclude with promising directions
 for future work in Section~\ref{section:conclusion}.
 %When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).