intro.tex

% !TEX root = main.tex

\section{Introduction}

Machine learning is currently shifting from a \emph{centralized}
paradigm, in which models are trained on data located on a single machine or
in a data center, to \emph{decentralized} ones.
Effectively, the latter paradigm closely matches the natural data distribution
in the numerous use-cases where data is collected and processed by several
independent
parties (hospitals, companies, personal devices...).
Federated Learning (FL) allows a set
of participants to collaboratively train machine learning models
on their joint
data while keeping it where it has been produced. Not only does this avoid
the costs of moving data, but it also  mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
Yet, working with natural data distributions introduces new challenges for
learning systems, as
local datasets
reflect the usage and production patterns specific to each participant: they are
\emph{not} independent and identically distributed
(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
across local datasets \cite{kairouz2019advances,quagmire}.
Therefore, one of the key challenges in FL is to design algorithms that
can efficiently deal with such non-IID data distributions
\cite{kairouz2019advances,fedprox,scaffold,quagmire}.

Federated learning algorithms can be classified into two categories depending
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary network topology
where participants communicate only with their direct neighbors
in the network. A classic example of such algorithms is Decentralized
SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
local SGD updates and model averaging with neighboring nodes.

In this paper, we focus on fully decentralized algorithms as they can
generally scale better to the large number of participants seen in ``cross-device''
applications \cite{kairouz2019advances}. Effectively, while a central
server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree 
\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically 
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids do not significantly affect the convergence
speed compared to using denser topologies.

\begin{figure*}[ht]
     \centering
     
     % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py ring/iid/all/2021-03-30-16:07:06-CEST ring/non-iid/all/2021-03-30-16:07:03-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/ring-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/ring-IID-vs-non-IID}
\caption{\label{fig:ring-IID-vs-non-IID} Ring}
     \end{subfigure}
     \quad
    % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py grid/iid/all/2021-03-30-16:07:01-CEST grid/non-iid/all/2021-03-30-16:06:59-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/grid-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/grid-IID-vs-non-IID}
\caption{\label{fig:grid-IID-vs-non-IID} Grid}
     \end{subfigure}
     \quad
         % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/iid/all/2021-03-30-16:07:20-CEST fully-connected/all/2021-03-10-09:25:19-CET  --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/fully-connected-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/fully-connected-IID-vs-non-IID}
\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected}
     \end{subfigure}
        \caption{IID vs non-IID convergence speed of decentralized SGD for
        logistic regression on
        MNIST for different topologies. Bold lines show the average test
        accuracy across nodes
        while thin lines show the minimum
        and maximum accuracy of individual nodes. While the effect of topology
        is negligible for IID data, it is very significant in the
        non-IID case. When fully-connected, both cases converge similarly. See
        Section~\ref{section:experimental-settings} for details on
        the experimental setup.}
        \label{fig:iid-vs-non-iid-problem}
\end{figure*}


In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}, where we observe that using a
sparse topology (such as a ring or
a grid) clearly jeopardizes the convergence speed when local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{label distribution skew} \cite{kairouz2019advances}.
We stress the fact
that, unlike in centralized FL
\cite{mcmahan2016communication,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we thus address the following
question:

\textit{Can we design sparse topologies with  convergence
  speed similar to a  fully connected network for problems involving
  many participants with label distribution skew?}

Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint label distribution of each clique is close to that of the global 
(IID) distribution; (2) We propose Greedy Swap, an algorithm for constructing
such cliques efficiently in the presence of heterogeneity previously studied
in the context of Federated Learning~\cite{mcmahan2016communication};
 (3) We propose Clique Averaging, a  modified version of 
the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections; 
(4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the non-IID setting; (5) We 
demonstrate
through an extensive experimental study that our approach  removes the effect
of label distribution skew on the MNIST~\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network;  (6) Finally, we demonstrate the scalability of our
approach by considering  up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
of nodes
\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.

For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
thereby yielding a 96\% reduction in the total number of required messages 
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.

The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes}
%We first present the problem
%statement and our methodology (Section~\ref{section:problem}). The D-Cliques
%design is presented in Section~\ref{section:d-cliques}) along with an
%empirical illustration of its benefits. In
%Section~\ref{section:clique-averaging-momentum}, we
%show how to further reduce bias with Clique Averaging and how to use it to
%implement momentum.  We present the results of our extensive experimental
%study in  Section~\ref{section:non-clustered}. We review some related work in
% Section~\ref{section:related-work}, and conclude with promising directions
% for future work in Section~\ref{section:conclusion}.