update abstract/intro to use heterogeneity

f4513e69 · aurelien.bellet · 0c409353 · f4513e69 · f4513e69
Commit f4513e69 authored 3 years ago by aurelien.bellet
--- a/mlsys2022style/intro.tex
+++ b/mlsys2022style/intro.tex
@@ -19,15 +19,14 @@ confidentiality concerns~\cite{kairouz2019advances}.
 Yet, working with natural data distributions introduces new challenges for
 learning systems, as
 local datasets
-reflect the usage and production patterns specific to each participant: they are
-\emph{not} independent and identically distributed
-(non-IID). In the context of classification problems, the
-relative
-frequency of different classes of examples may significantly vary
-across local datasets, a situation known as \emph{label distribution skew} 
-\cite{kairouz2019advances,quagmire}.
-Therefore, one of the key challenges in FL is to design algorithms that
-can efficiently deal with such non-IID data distributions
+reflect the usage and production patterns specific to each participant: in
+other words, they are
+\emph{heterogeneous}. An important type of data heterogeneity encountered in
+classification problems, known as \emph{label distribution skew} 
+\cite{kairouz2019advances,quagmire}, occurs when the frequency of different
+classes of examples may vary significantly across local datasets.
+One of the key challenges in FL is to design algorithms that
+can efficiently deal with such heterogeneous data distributions
 \cite{kairouz2019advances,fedprox,scaffold,quagmire}.

 Federated learning algorithms can be classified into two categories depending
@@ -46,13 +45,15 @@ generally scale better to the large number of participants seen in ``cross-devic
 applications \cite{kairouz2019advances}. Effectively, while a central
 server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
 enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree 
-\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically 
+\cite{lian2017d-psgd}. In the homogeneous setting where data is
+independent and identically distributed (IID) across nodes, recent work
+has shown both empirically
 \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
 topologies like rings or grids
 do not significantly affect the convergence
 speed compared to using denser topologies.

-\begin{figure*}[ht]
+\begin{figure*}[t]
     \centering
     
     % From directory results/mnist
@@ -60,7 +61,7 @@ speed compared to using denser topologies.
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/ring-IID-vs-non-IID}
-\caption{\label{fig:ring-IID-vs-non-IID} Ring}
+\caption{\label{fig:ring-IID-vs-non-IID} Ring topology}
     \end{subfigure}
     \quad
    % From directory results/mnist
@@ -68,7 +69,7 @@ speed compared to using denser topologies.
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/grid-IID-vs-non-IID}
-\caption{\label{fig:grid-IID-vs-non-IID} Grid}
+\caption{\label{fig:grid-IID-vs-non-IID} Grid topology}
     \end{subfigure}
     \quad
         % From directory results/mnist
@@ -76,25 +77,28 @@ speed compared to using denser topologies.
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/fully-connected-IID-vs-non-IID}
-\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected}
+\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected topology}
     \end{subfigure}
-        \caption{IID vs non-IID convergence speed of decentralized SGD for
-        logistic regression on
-        MNIST for different topologies. Bold lines show the average test
+        \caption{Convergence speed of decentralized
+        SGD with and without label distribution skew for different topologies.
+        The task is logistic regression on MNIST (see
+        Section~\ref{section:experimental-settings} for details on
+        the experimental setup). Bold lines show the
+        average test
        accuracy across nodes
        while thin lines show the minimum
        and maximum accuracy of individual nodes. While the effect of topology
-        is negligible for IID data, it is very significant in the
-        non-IID case. When fully-connected, both cases converge similarly. See
-        Section~\ref{section:experimental-settings} for details on
-        the experimental setup.}
+        is negligible for homogeneous data, it is very significant in the
+        heterogeneous case. On a fully-connected network, both cases converge
+        similarly.}
        \label{fig:iid-vs-non-iid-problem}
 \end{figure*}


-
-In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
-in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
+\todo{AB: update fig legend to not use (non)IID terms}
+In contrast to the homogeneous case however, our experiments demonstrate that 
+\emph{the impact of topology is extremely significant for heterogeneous data}.
+This phenomenon is illustrated in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
 label distribution skew, using a
 sparse topology (a ring or
 a grid) clearly jeopardizes the convergence speed of decentralized SGD.
@@ -113,7 +117,7 @@ Specifically, we make the following contributions:
 (1) We propose D-Cliques, a sparse topology in which nodes are organized in
 interconnected cliques, i.e. locally fully-connected sets of nodes, such that
 the joint label distribution of each clique is close to that of the global 
-(IID) distribution; (2) We design a greedy algorithm for
+distribution; (2) We design a greedy algorithm for
 constructing such cliques efficiently;
 % in the presence of heterogeneity previously studied
 % in the context of Federated Learning~\cite{mcmahan2016communication};

--- a/mlsys2022style/main.tex
+++ b/mlsys2022style/main.tex
@@ -42,8 +42,8 @@
 \begin{document}

 \twocolumn[
-\mlsystitle{D-Cliques: Compensating Data Heterogeneity with Topology in Decentralized
-Federated Learning}
+\mlsystitle{D-Cliques: Compensating for Data Heterogeneity with Topology in
+Decentralized Federated Learning}

 % It is OKAY to include author information, even for blind
 % submissions: the style file will automatically remove it for you
@@ -84,10 +84,9 @@ Non-IID Data, Stochastic Gradient Descent}
 %Abstracts must be a single paragraph, ideally between 4--6 sentences long.
 %Gross violations will trigger corrections at the camera-ready phase.
 The convergence speed of machine learning models trained with Federated
-Learning is significantly affected by non-independent and identically
-distributed (non-IID) data partitions, even more so in a fully decentralized
-setting without a central server. In this paper, we show that the impact of
-label distribution skew, an important type of data non-IIDness, can be
+Learning is significantly affected by heterogeneous data partitions, even more
+so in a fully decentralized setting without a central server. In this paper, we show that the impact of
+label distribution skew, an important type of data heterogeneity, can be
 significantly reduced by carefully designing
 the underlying communication topology. We present D-Cliques, a novel topology
 that reduces gradient bias by grouping nodes in sparsely interconnected