% !TEX root = main.tex \section{Introduction} Machine learning is currently shifting from a \emph{centralized} paradigm, where training data is located on a single machine or in a data center, to \emph{decentralized} ones in which data is processed where it was naturally produced. This shift is illustrated by the rise of Federated Learning (FL). FL allows several parties (hospitals, companies, personal devices...) to collaboratively train machine learning models on their joint data without centralizing it. Not only does FL avoid the costs of moving data, but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}. Yet, working with natural data distributions introduces new challenges for learning systems, as local datasets reflect the usage and production patterns specific to each participant: in other words, they are \emph{heterogeneous}. An important type of data heterogeneity encountered in classification problems, known as \emph{label distribution skew} \cite{kairouz2019advances,quagmire}, occurs when the frequency of different classes of examples may vary significantly across local datasets. One of the key challenges in FL is to design algorithms that can efficiently deal with such heterogeneous data distributions \cite{kairouz2019advances,fedprox,scaffold,quagmire}. Federated learning algorithms can be classified into two categories depending on the underlying network topology they run on. In server-based FL, the network is organized according to a star topology: a central server orchestrates the training process by iteratively aggregating model updates received from the participants (\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast, fully decentralized FL algorithms operate over an arbitrary network topology where participants communicate only with their direct neighbors in the network. A classic example of such algorithms is Decentralized SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between local SGD updates and model averaging with neighboring nodes. In this paper, we focus on fully decentralized algorithms as they can generally scale better to the large number of participants seen in ``cross-device'' applications \cite{kairouz2019advances}. Effectively, while a central server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree \cite{lian2017d-psgd}. In the homogeneous setting where data is independent and identically distributed (IID) across nodes, recent work has shown both empirically \cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse topologies like rings or grids do not significantly affect the convergence speed compared to using denser topologies. \begin{figure*}[t] \centering % From directory results/mnist % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py ring/iid/all/2021-03-30-16:07:06-CEST ring/non-iid/all/2021-03-30-16:07:03-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/ring-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed' \begin{subfigure}[b]{0.25\textwidth} \centering \includegraphics[width=\textwidth]{../figures/ring-IID-vs-non-IID} \caption{\label{fig:ring-IID-vs-non-IID} Ring topology} \end{subfigure} \quad % From directory results/mnist % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py grid/iid/all/2021-03-30-16:07:01-CEST grid/non-iid/all/2021-03-30-16:06:59-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/grid-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed' \begin{subfigure}[b]{0.25\textwidth} \centering \includegraphics[width=\textwidth]{../figures/grid-IID-vs-non-IID} \caption{\label{fig:grid-IID-vs-non-IID} Grid topology} \end{subfigure} \quad % From directory results/mnist % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/iid/all/2021-03-30-16:07:20-CEST fully-connected/all/2021-03-10-09:25:19-CET --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/fully-connected-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed' \begin{subfigure}[b]{0.25\textwidth} \centering \includegraphics[width=\textwidth]{../figures/fully-connected-IID-vs-non-IID} \caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected topology} \end{subfigure} \caption{Convergence speed of decentralized SGD with and without label distribution skew for different topologies. The task is logistic regression on MNIST (see Section~\ref{section:experimental-settings} for details on the experimental setup). Bold lines show the average test accuracy across nodes while thin lines show the minimum and maximum accuracy of individual nodes. While the effect of topology is negligible for homogeneous data, it is very significant in the heterogeneous case. On a fully-connected network, both cases converge similarly.} \label{fig:iid-vs-non-iid-problem} \end{figure*} \todo{AB: update fig legend to not use (non)IID terms} In contrast to the homogeneous case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for heterogeneous data}. This phenomenon is illustrated in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under label distribution skew, using a sparse topology (a ring or a grid) clearly jeopardizes the convergence speed of decentralized SGD. We stress the fact that, unlike in centralized FL \cite{mcmahan2016communication,scaffold,quagmire}, this happens even when nodes perform a single local update before averaging the model with their neighbors. In this paper, we thus address the following question: \textit{Can we design sparse topologies with convergence speed similar to a fully connected network for problems involving many participants with label distribution skew?} Specifically, we make the following contributions: (1) We propose D-Cliques, a sparse topology in which nodes are organized in interconnected cliques, i.e. locally fully-connected sets of nodes, such that the joint label distribution of each clique is close to that of the global distribution; (2) We design a greedy algorithm for constructing such cliques efficiently; % in the presence of heterogeneity previously studied % in the context of Federated Learning~\cite{mcmahan2016communication}; (3) We introduce Clique Averaging, a modified version of the standard D-SGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure that all models converge, therefore reducing the bias introduced by inter-clique connections; (4) We show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in the non-IID setting; (5) We demonstrate through an extensive experimental study that our approach removes the effect of label distribution skew when training a linear model and a deep convolutional network on the MNIST %~\cite{mnistWebsite} and CIFAR10 % ~\cite{krizhevsky2009learning} datasets respectively ; (6) Finally, we demonstrate the scalability of our approach by considering up to 1000-node networks, in contrast to most previous work on fully decentralized learning that considers only a few tens of nodes \cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}. For instance, our results show that using D-Cliques in a 1000-node network requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) to obtain a similar convergence speed as a fully-connected topology, thereby yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999). Furthermore an additional 22\% improvement % (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales through a quasilinear $O(n \log n)$ scaling in the number of nodes $n$. The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes} %We first present the problem %statement and our methodology (Section~\ref{section:problem}). The D-Cliques %design is presented in Section~\ref{section:d-cliques}) along with an %empirical illustration of its benefits. In %Section~\ref{section:clique-averaging-momentum}, we %show how to further reduce bias with Clique Averaging and how to use it to %implement momentum. We present the results of our extensive experimental %study in Section~\ref{section:non-clustered}. We review some related work in % Section~\ref{section:related-work}, and conclude with promising directions % for future work in Section~\ref{section:conclusion}.