% !TEX root = main.tex \section{Introduction} Machine learning is currently shifting from a \emph{centralized} paradigm, where training data is located on a single machine or in a data center, to \emph{decentralized} ones in which data is processed where it was naturally produced. This shift is illustrated by the rise of Federated Learning (FL). FL allows several parties (hospitals, companies, personal devices...) to collaboratively train machine learning models on their joint data without centralizing it. Not only does FL avoid the costs of moving data, but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}. Yet, working with natural data distributions introduces new challenges for learning systems, as local datasets reflect the usage and production patterns specific to each participant: they are \emph{not} independent and identically distributed (non-IID). In the context of classification problems, the relative frequency of different classes of examples may significantly vary across local datasets, a situation known as \emph{label distribution skew} \cite{kairouz2019advances,quagmire}. Therefore, one of the key challenges in FL is to design algorithms that can efficiently deal with such non-IID data distributions \cite{kairouz2019advances,fedprox,scaffold,quagmire}. Federated learning algorithms can be classified into two categories depending on the underlying network topology they run on. In server-based FL, the network is organized according to a star topology: a central server orchestrates the training process by iteratively aggregating model updates received from the participants (\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast, fully decentralized FL algorithms operate over an arbitrary network topology where participants communicate only with their direct neighbors in the network. A classic example of such algorithms is Decentralized SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between local SGD updates and model averaging with neighboring nodes. In this paper, we focus on fully decentralized algorithms as they can generally scale better to the large number of participants seen in ``cross-device'' applications \cite{kairouz2019advances}. Effectively, while a central server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree \cite{lian2017d-psgd}. For IID data, recent work has shown both empirically \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse topologies like rings or grids do not significantly affect the convergence speed compared to using denser topologies. \begin{figure*}[ht] \centering % From directory results/mnist % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py ring/iid/all/2021-03-30-16:07:06-CEST ring/non-iid/all/2021-03-30-16:07:03-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/ring-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed' \begin{subfigure}[b]{0.25\textwidth} \centering \includegraphics[width=\textwidth]{../figures/ring-IID-vs-non-IID} \caption{\label{fig:ring-IID-vs-non-IID} Ring} \end{subfigure} \quad % From directory results/mnist % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py grid/iid/all/2021-03-30-16:07:01-CEST grid/non-iid/all/2021-03-30-16:06:59-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/grid-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed' \begin{subfigure}[b]{0.25\textwidth} \centering \includegraphics[width=\textwidth]{../figures/grid-IID-vs-non-IID} \caption{\label{fig:grid-IID-vs-non-IID} Grid} \end{subfigure} \quad % From directory results/mnist % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/iid/all/2021-03-30-16:07:20-CEST fully-connected/all/2021-03-10-09:25:19-CET --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/fully-connected-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed' \begin{subfigure}[b]{0.25\textwidth} \centering \includegraphics[width=\textwidth]{../figures/fully-connected-IID-vs-non-IID} \caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected} \end{subfigure} \caption{IID vs non-IID convergence speed of decentralized SGD for logistic regression on MNIST for different topologies. Bold lines show the average test accuracy across nodes while thin lines show the minimum and maximum accuracy of individual nodes. While the effect of topology is negligible for IID data, it is very significant in the non-IID case. When fully-connected, both cases converge similarly. See Section~\ref{section:experimental-settings} for details on the experimental setup.} \label{fig:iid-vs-non-iid-problem} \end{figure*} In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under label distribution skew, using a sparse topology (a ring or a grid) clearly jeopardizes the convergence speed of decentralized SGD. We stress the fact that, unlike in centralized FL \cite{mcmahan2016communication,scaffold,quagmire}, this happens even when nodes perform a single local update before averaging the model with their neighbors. In this paper, we thus address the following question: \textit{Can we design sparse topologies with convergence speed similar to a fully connected network for problems involving many participants with label distribution skew?} Specifically, we make the following contributions: (1) We propose D-Cliques, a sparse topology in which nodes are organized in interconnected cliques, i.e. locally fully-connected sets of nodes, such that the joint label distribution of each clique is close to that of the global (IID) distribution; (2) We design a greedy algorithm for constructing such cliques efficiently; % in the presence of heterogeneity previously studied % in the context of Federated Learning~\cite{mcmahan2016communication}; (3) We introduce Clique Averaging, a modified version of the standard D-SGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure that all models converge, therefore reducing the bias introduced by inter-clique connections; (4) We show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in the non-IID setting; (5) We demonstrate through an extensive experimental study that our approach removes the effect of label distribution skew when training a linear model and a deep convolutional network on the MNIST %~\cite{mnistWebsite} and CIFAR10 % ~\cite{krizhevsky2009learning} datasets respectively ; (6) Finally, we demonstrate the scalability of our approach by considering up to 1000-node networks, in contrast to most previous work on fully decentralized learning that considers only a few tens of nodes \cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}. For instance, our results show that using D-Cliques in a 1000-node network requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) to obtain a similar convergence speed as a fully-connected topology, thereby yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999). Furthermore an additional 22\% improvement % (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales through a quasilinear $O(n \log n)$ scaling in the number of nodes $n$. The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes} %We first present the problem %statement and our methodology (Section~\ref{section:problem}). The D-Cliques %design is presented in Section~\ref{section:d-cliques}) along with an %empirical illustration of its benefits. In %Section~\ref{section:clique-averaging-momentum}, we %show how to further reduce bias with Clique Averaging and how to use it to %implement momentum. We present the results of our extensive experimental %study in Section~\ref{section:non-clustered}. We review some related work in % Section~\ref{section:related-work}, and conclude with promising directions % for future work in Section~\ref{section:conclusion}.