intro

24d2d423 · aurelien.bellet · 61ee6167 · 24d2d423
Commit 24d2d423 authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -66,6 +66,8 @@ provides similar convergence speed as a fully-connected topology with a
 significant reduction in the number of edges and messages. In a 1000-node
 topology, D-Cliques requires 98\% less edges and 96\% less total messages,
 with further possible gains using a small-world topology across cliques.
+% Our study paves the way for tackling more general types of data non-IIDness
+% through the design of appropriate topologies.

 \keywords{Decentralized Learning \and Federated Learning \and Topology \and
 Non-IID Data \and Stochastic Gradient Descent}
@@ -83,12 +85,18 @@ Non-IID Data \and Stochastic Gradient Descent}
 Machine learning is currently shifting from a \emph{centralized}
 paradigm, in which models are trained on data located on a single machine or
 in a data center, to \emph{decentralized} ones.
-Effectively, such paradigm matches the natural data distribution as data is collected by several independent parties (hospitals, companies, personal devices...) and trained on participants' devices. 
+Effectively, the latter paradigm closely matches the natural data distribution
+in the numerous use-cases where data is collected and processed by several
+independent
+parties (hospitals, companies, personal devices...).
 Federated Learning (FL) allows a set
-of data owners to collaboratively train machine learning models
+of participants to collaboratively train machine learning models
 on their joint
-data while keeping it where it has been produced. Not only, this avoids the costs of moving data but it also  mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}. 
-Yet, such a data distribution challenges learning systems, as local datasets reflect the usage and production patterns specific to each participant: they are 
+data while keeping it where it has been produced. Not only does this avoid
+the costs of moving data, but it also  mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
+Yet, such data distribution introduces new challenges for learning systems, as
+local datasets
+reflect the usage and production patterns specific to each participant: they are
 \emph{not} independent and identically distributed
 (non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
 across local datasets \cite{quagmire}.
@@ -97,13 +105,14 @@ can efficiently deal with such non-IID data
 \cite{kairouz2019advances,fedprox,scaffold,quagmire}.

 Federated learning algorithms can be classified into two categories depending
-on the underlying network topology they run on. In server-based FL, the network is organized according to a star topology: a central server orchestrates the training process and
-iteratively aggregates model updates received from the participants
-(\emph{clients}) and sends
+on the underlying network topology they run on. In server-based FL, the
+network is organized according to a star topology: a central server orchestrates the training process by
+iteratively aggregating model updates received from the participants
+(\emph{clients}) and sending
 them back the aggregated model \cite{mcmahan2016communication}. In contrast,
-fully decentralized FL algorithms operate over an arbitrary topology where
-participants communicate only with their direct neighbors
-in the underlying communcation graph. A classic example of such algorithms is Decentralized
+fully decentralized FL algorithms operate over an arbitrary graph topology
+where participants communicate only with their direct neighbors
+in the graph. A classic example of such algorithms is Decentralized
 SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
 local SGD updates and model averaging with neighboring nodes.

@@ -112,40 +121,15 @@ generally scale better to the large number of participants seen in ``cross-devic
 applications \cite{kairouz2019advances}. Effectively, while a central
 server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
 enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree 
-\cite{lian2017d-psgd}. Recent work has shown both empirically 
+\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically 
 \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
 topologies like rings or grids do not significantly affect the convergence
-speed compared to using denser topologies with IID data.
+speed compared to using denser topologies.
 % We also note that full decentralization can also provide benefits in terms of
 % privacy protection \cite{amp_dec}.

-In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremly significant for non-IID data}. This phenomenon is illustrated
-in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that  a ring or
-a grid topology clearly jeopardizes the convergence speed as local distributions do not have relative frequency of classes similar to the global distribution, i.e. they have \textit{local class bias}. We also observe that, unlike in centralized FL 
-\cite{kairouz2019advances,scaffold,quagmire}, this
-happens even when nodes perform a single local update before averaging the
-model with their neighbors. In this paper, we address the following question:
-
-% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
-
-%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
-
-\textit{Can we design sparse topologies with  convergence
-  speed similar to the one obtained in a  fully connected graph communication network under a large number of participants with local class bias?}
-%AMK: do we talk about local class bias or noniidness?
-
-
-%Indeed, as we show with the following contributions:
-In this paper, we make the following contributions:
-(1) We propose D-Cliques, a sparse topology in which nodes are organized in cliques, i.e. locally fully-connected sets of nodes, such that the joint data distribution of each clique is representative of the global (IID) distribution; (2) We propose Clique Averaging, a  modified version of  the standard D-SGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure all models converge, therefore reducing the bias introduced by inter-clique connections; (3) We show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in a non-IID setting; (4) We  demonstrate though an extensive experimental study that our approach  removes the effect of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~\cite{krizhevsky2009learning} datasets, with a linear and deep convolutional network;  (5) Finally, we demonstrate the scalability of our approach by considering  up to 1000 node networks, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants~\cite{tang18a,more_refs}.
-%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}. 
-
-For instance, our results show that using D-Cliques in a 1000 node network requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
-
-The rest of this paper is organized as follows. We first present the problem statement and methodology (Section~\ref{section:problem}). The D-Cliques design is presented in Section~\ref{section:d-cliques}) along with its benefit. In Section~\ref{section:clique-averaging}, we show how to further reduce bias with Clique Averaging and how to use it to implement momentum.  We present the results or our extensive experimental study in  Section~\ref{section:non-clustered}). Related work is surveyed in Section~\ref{section:related-work}) before we conclude in (Section~\ref{section:conclusion}.
-%When then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).

-\begin{figure}
+\begin{figure}[t]
     \centering
     
     % From directory results/mnist
@@ -153,7 +137,7 @@ The rest of this paper is organized as follows. We first present the problem sta
     \begin{subfigure}[b]{0.31\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/ring-IID-vs-non-IID}
-\caption{\label{fig:ring-IID-vs-non-IID} Ring: (almost) minimal connectivity.}
+\caption{\label{fig:ring-IID-vs-non-IID} Ring}
     \end{subfigure}
     \quad
    % From directory results/mnist
@@ -161,7 +145,7 @@ The rest of this paper is organized as follows. We first present the problem sta
     \begin{subfigure}[b]{0.31\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/grid-IID-vs-non-IID}
-\caption{\label{fig:grid-IID-vs-non-IID} Grid: intermediate connectivity.}
+\caption{\label{fig:grid-IID-vs-non-IID} Grid}
     \end{subfigure}
     \quad
         % From directory results/mnist
@@ -169,14 +153,78 @@ The rest of this paper is organized as follows. We first present the problem sta
     \begin{subfigure}[b]{0.31\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/fully-connected-IID-vs-non-IID}
-\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected: max connectivity.}
+\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected}
     \end{subfigure}
-        \caption{IID vs non-IID Convergence Speed on MNIST with Linear Model. Thin lines are the minimum
-        and maximum accuracy of individual nodes. Bold lines are the average
-        accuracy across all nodes. The blue curve shows convergence in the IID case: the topology has limited effect. The orange curve shows convergence in the non-IID case: the topology has a significant effect. When fully-connected, both cases converge similarly. See Section~\ref{section:experimental-settings} for experimental settings.}
+        \caption{IID vs non-IID convergence speed of decentralized SGD for
+        logistic regression on
+        MNIST for different topologies. Bold lines show the average test
+        accuracy across nodes
+        while thin lines show the minimum
+        and maximum accuracy of individual nodes. While the effect of topology
+        is negligible for IID data, it is very significant in the
+        non-IID case. When fully-connected, both cases converge similarly. See
+        Section~\ref{section:experimental-settings} for details on
+        the experimental setup.}
        \label{fig:iid-vs-non-iid-problem}
 \end{figure}

+
+In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
+in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that  a ring or
+a grid topology clearly jeopardizes the convergence speed as local
+distributions do not have relative frequency of classes similar to the global
+distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL
+\cite{kairouz2019advances,scaffold,quagmire}, this
+happens even when nodes perform a single local update before averaging the
+model with their neighbors. In this paper, we address the following question:
+
+% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
+
+%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
+
+\textit{Can we design sparse topologies with  convergence
+  speed similar to the one obtained in a  fully connected network graph under
+  a large number of participants with local class bias?}
+%AMK: do we talk about local class bias or noniidness?
+
+
+%Indeed, as we show with the following contributions:
+In this paper, we make the following contributions:
+(1) We propose D-Cliques, a sparse topology in which nodes are organized in
+interconnected cliques, i.e. locally fully-connected sets of nodes, such that
+the joint data distribution of each clique is representative of the global 
+(IID) distribution; (2) We propose Clique Averaging, a  modified version of 
+the standard D-SGD algorithm which decouples gradient averaging, used for
+optimizing local models, from distributed averaging, used to ensure all models
+converge, therefore reducing the bias introduced by inter-clique connections; 
+(3) We show how Clique Averaging can be used to implement unbiased momentum
+that would otherwise be detrimental in a non-IID setting; (4) We  demonstrate
+through an extensive experimental study that our approach  removes the effect
+of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~
+\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
+convolutional network;  (5) Finally, we demonstrate the scalability of our
+approach by considering  up to 1000 node networks, in contrast to most
+previous work on fully decentralized learning that considers only a few tens
+of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
+%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}. 
+
+For instance, our results show that using D-Cliques in a 1000-node network
+requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement 
+% (14.5 edges per node on average instead of 18.9)
+is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
+
+The rest of this paper is organized as follows. We first present the problem
+statement and our methodology (Section~\ref{section:problem}). The D-Cliques
+design is presented in Section~\ref{section:d-cliques}) along with an
+empirical illustration of its benefits. In
+Section~\ref{section:clique-averaging-momentum}, we
+show how to further reduce bias with Clique Averaging and how to use it to
+implement momentum.  We present the results or our extensive experimental
+study in  Section~\ref{section:non-clustered}. We review some related work in
+ Section~\ref{section:related-work}), and conclude with promising directions
+ for future work in Section~\ref{section:conclusion}.
+ %When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
+
 %\footnotetext{This is different from the accuracy of the average model across nodes that is sometimes used once training is completed.}

 \section{Problem Statement}
@@ -338,7 +386,7 @@ A network of 100 non-IID nodes with D-Cliques is illustrated in Figure~\ref{fig:


 \section{Optimizing with Clique Averaging}
-\label{section:clique-averaging}
+\label{section:clique-averaging-momentum}

 In this section we present Clique Averaging,a feature that removes further the bias introduce by data non-iidness.
 %AMK: check