diff --git a/main.tex b/main.tex index 83330dbaddd46d4061560859c403930f0e9a969b..45e69d6ebf5aa713fc775e10ef3f650ba1e3d574 100644 --- a/main.tex +++ b/main.tex @@ -66,6 +66,8 @@ provides similar convergence speed as a fully-connected topology with a significant reduction in the number of edges and messages. In a 1000-node topology, D-Cliques requires 98\% less edges and 96\% less total messages, with further possible gains using a small-world topology across cliques. +% Our study paves the way for tackling more general types of data non-IIDness +% through the design of appropriate topologies. \keywords{Decentralized Learning \and Federated Learning \and Topology \and Non-IID Data \and Stochastic Gradient Descent} @@ -83,12 +85,18 @@ Non-IID Data \and Stochastic Gradient Descent} Machine learning is currently shifting from a \emph{centralized} paradigm, in which models are trained on data located on a single machine or in a data center, to \emph{decentralized} ones. -Effectively, such paradigm matches the natural data distribution as data is collected by several independent parties (hospitals, companies, personal devices...) and trained on participants' devices. +Effectively, the latter paradigm closely matches the natural data distribution +in the numerous use-cases where data is collected and processed by several +independent +parties (hospitals, companies, personal devices...). Federated Learning (FL) allows a set -of data owners to collaboratively train machine learning models +of participants to collaboratively train machine learning models on their joint -data while keeping it where it has been produced. Not only, this avoids the costs of moving data but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}. -Yet, such a data distribution challenges learning systems, as local datasets reflect the usage and production patterns specific to each participant: they are +data while keeping it where it has been produced. Not only does this avoid +the costs of moving data, but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}. +Yet, such data distribution introduces new challenges for learning systems, as +local datasets +reflect the usage and production patterns specific to each participant: they are \emph{not} independent and identically distributed (non-IID). More specifically, the relative frequency of different classes of examples may significantly vary across local datasets \cite{quagmire}. @@ -97,13 +105,14 @@ can efficiently deal with such non-IID data \cite{kairouz2019advances,fedprox,scaffold,quagmire}. Federated learning algorithms can be classified into two categories depending -on the underlying network topology they run on. In server-based FL, the network is organized according to a star topology: a central server orchestrates the training process and -iteratively aggregates model updates received from the participants -(\emph{clients}) and sends +on the underlying network topology they run on. In server-based FL, the +network is organized according to a star topology: a central server orchestrates the training process by +iteratively aggregating model updates received from the participants +(\emph{clients}) and sending them back the aggregated model \cite{mcmahan2016communication}. In contrast, -fully decentralized FL algorithms operate over an arbitrary topology where -participants communicate only with their direct neighbors -in the underlying communcation graph. A classic example of such algorithms is Decentralized +fully decentralized FL algorithms operate over an arbitrary graph topology +where participants communicate only with their direct neighbors +in the graph. A classic example of such algorithms is Decentralized SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between local SGD updates and model averaging with neighboring nodes. @@ -112,40 +121,15 @@ generally scale better to the large number of participants seen in ``cross-devic applications \cite{kairouz2019advances}. Effectively, while a central server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree -\cite{lian2017d-psgd}. Recent work has shown both empirically +\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse topologies like rings or grids do not significantly affect the convergence -speed compared to using denser topologies with IID data. +speed compared to using denser topologies. % We also note that full decentralization can also provide benefits in terms of % privacy protection \cite{amp_dec}. -In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremly significant for non-IID data}. This phenomenon is illustrated -in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or -a grid topology clearly jeopardizes the convergence speed as local distributions do not have relative frequency of classes similar to the global distribution, i.e. they have \textit{local class bias}. We also observe that, unlike in centralized FL -\cite{kairouz2019advances,scaffold,quagmire}, this -happens even when nodes perform a single local update before averaging the -model with their neighbors. In this paper, we address the following question: - -% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?} - -%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?} - -\textit{Can we design sparse topologies with convergence - speed similar to the one obtained in a fully connected graph communication network under a large number of participants with local class bias?} -%AMK: do we talk about local class bias or noniidness? - - -%Indeed, as we show with the following contributions: -In this paper, we make the following contributions: -(1) We propose D-Cliques, a sparse topology in which nodes are organized in cliques, i.e. locally fully-connected sets of nodes, such that the joint data distribution of each clique is representative of the global (IID) distribution; (2) We propose Clique Averaging, a modified version of the standard D-SGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure all models converge, therefore reducing the bias introduced by inter-clique connections; (3) We show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in a non-IID setting; (4) We demonstrate though an extensive experimental study that our approach removes the effect of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~\cite{krizhevsky2009learning} datasets, with a linear and deep convolutional network; (5) Finally, we demonstrate the scalability of our approach by considering up to 1000 node networks, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants~\cite{tang18a,more_refs}. -%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}. - -For instance, our results show that using D-Cliques in a 1000 node network requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling. - -The rest of this paper is organized as follows. We first present the problem statement and methodology (Section~\ref{section:problem}). The D-Cliques design is presented in Section~\ref{section:d-cliques}) along with its benefit. In Section~\ref{section:clique-averaging}, we show how to further reduce bias with Clique Averaging and how to use it to implement momentum. We present the results or our extensive experimental study in Section~\ref{section:non-clustered}). Related work is surveyed in Section~\ref{section:related-work}) before we conclude in (Section~\ref{section:conclusion}. -%When then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}). -\begin{figure} +\begin{figure}[t] \centering % From directory results/mnist @@ -153,7 +137,7 @@ The rest of this paper is organized as follows. We first present the problem sta \begin{subfigure}[b]{0.31\textwidth} \centering \includegraphics[width=\textwidth]{figures/ring-IID-vs-non-IID} -\caption{\label{fig:ring-IID-vs-non-IID} Ring: (almost) minimal connectivity.} +\caption{\label{fig:ring-IID-vs-non-IID} Ring} \end{subfigure} \quad % From directory results/mnist @@ -161,7 +145,7 @@ The rest of this paper is organized as follows. We first present the problem sta \begin{subfigure}[b]{0.31\textwidth} \centering \includegraphics[width=\textwidth]{figures/grid-IID-vs-non-IID} -\caption{\label{fig:grid-IID-vs-non-IID} Grid: intermediate connectivity.} +\caption{\label{fig:grid-IID-vs-non-IID} Grid} \end{subfigure} \quad % From directory results/mnist @@ -169,14 +153,78 @@ The rest of this paper is organized as follows. We first present the problem sta \begin{subfigure}[b]{0.31\textwidth} \centering \includegraphics[width=\textwidth]{figures/fully-connected-IID-vs-non-IID} -\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected: max connectivity.} +\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected} \end{subfigure} - \caption{IID vs non-IID Convergence Speed on MNIST with Linear Model. Thin lines are the minimum - and maximum accuracy of individual nodes. Bold lines are the average - accuracy across all nodes. The blue curve shows convergence in the IID case: the topology has limited effect. The orange curve shows convergence in the non-IID case: the topology has a significant effect. When fully-connected, both cases converge similarly. See Section~\ref{section:experimental-settings} for experimental settings.} + \caption{IID vs non-IID convergence speed of decentralized SGD for + logistic regression on + MNIST for different topologies. Bold lines show the average test + accuracy across nodes + while thin lines show the minimum + and maximum accuracy of individual nodes. While the effect of topology + is negligible for IID data, it is very significant in the + non-IID case. When fully-connected, both cases converge similarly. See + Section~\ref{section:experimental-settings} for details on + the experimental setup.} \label{fig:iid-vs-non-iid-problem} \end{figure} + +In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated +in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or +a grid topology clearly jeopardizes the convergence speed as local +distributions do not have relative frequency of classes similar to the global +distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL +\cite{kairouz2019advances,scaffold,quagmire}, this +happens even when nodes perform a single local update before averaging the +model with their neighbors. In this paper, we address the following question: + +% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?} + +%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?} + +\textit{Can we design sparse topologies with convergence + speed similar to the one obtained in a fully connected network graph under + a large number of participants with local class bias?} +%AMK: do we talk about local class bias or noniidness? + + +%Indeed, as we show with the following contributions: +In this paper, we make the following contributions: +(1) We propose D-Cliques, a sparse topology in which nodes are organized in +interconnected cliques, i.e. locally fully-connected sets of nodes, such that +the joint data distribution of each clique is representative of the global +(IID) distribution; (2) We propose Clique Averaging, a modified version of +the standard D-SGD algorithm which decouples gradient averaging, used for +optimizing local models, from distributed averaging, used to ensure all models +converge, therefore reducing the bias introduced by inter-clique connections; +(3) We show how Clique Averaging can be used to implement unbiased momentum +that would otherwise be detrimental in a non-IID setting; (4) We demonstrate +through an extensive experimental study that our approach removes the effect +of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~ +\cite{krizhevsky2009learning} datasets, for training a linear model and a deep +convolutional network; (5) Finally, we demonstrate the scalability of our +approach by considering up to 1000 node networks, in contrast to most +previous work on fully decentralized learning that considers only a few tens +of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}. +%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}. + +For instance, our results show that using D-Cliques in a 1000-node network +requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement +% (14.5 edges per node on average instead of 18.9) +is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling. + +The rest of this paper is organized as follows. We first present the problem +statement and our methodology (Section~\ref{section:problem}). The D-Cliques +design is presented in Section~\ref{section:d-cliques}) along with an +empirical illustration of its benefits. In +Section~\ref{section:clique-averaging-momentum}, we +show how to further reduce bias with Clique Averaging and how to use it to +implement momentum. We present the results or our extensive experimental +study in Section~\ref{section:non-clustered}. We review some related work in + Section~\ref{section:related-work}), and conclude with promising directions + for future work in Section~\ref{section:conclusion}. + %When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}). + %\footnotetext{This is different from the accuracy of the average model across nodes that is sometimes used once training is completed.} \section{Problem Statement} @@ -338,7 +386,7 @@ A network of 100 non-IID nodes with D-Cliques is illustrated in Figure~\ref{fig: \section{Optimizing with Clique Averaging} -\label{section:clique-averaging} +\label{section:clique-averaging-momentum} In this section we present Clique Averaging,a feature that removes further the bias introduce by data non-iidness. %AMK: check