intro.tex

% !TEX root = main.tex

\section{Introduction}

Machine learning is currently shifting from a \emph{centralized}
paradigm, where training data is located on a single
machine or
in a data center, to \emph{decentralized} ones in which data is processed
where it was naturally produced.
This shift is illustrated by the rise of Federated
Learning
(FL). FL allows
several parties (hospitals, companies, personal
devices...) to collaboratively train machine learning models
on their joint
data without centralizing it. Not only does FL
avoid the costs of moving data, but it also  mitigates privacy and
confidentiality concerns~\cite{kairouz2019advances}.
Yet, working with natural data distributions introduces new challenges for
learning systems, as
local datasets
reflect the usage and production patterns specific to each participant: in
other words, they are
\emph{heterogeneous}. An important type of data heterogeneity encountered in
classification problems, known as \emph{label distribution skew} 
\cite{kairouz2019advances,quagmire}, occurs when the frequency of different
classes of examples may vary significantly across local datasets.
One of the key challenges in FL is to design algorithms that
can efficiently deal with such heterogeneous data distributions
\cite{kairouz2019advances,fedprox,scaffold,quagmire}.

Federated learning algorithms can be classified into two categories depending
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary network topology
where participants communicate only with their direct neighbors
in the network. A classic example of such algorithms is Decentralized
SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
local SGD updates and model averaging with neighboring nodes.

In this paper, we focus on fully decentralized algorithms as they can
generally scale better to the large number of participants seen in ``cross-device''
applications \cite{kairouz2019advances}. Effectively, while a central
server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree 
\cite{lian2017d-psgd}. In the homogeneous setting where data is
independent and identically distributed (IID) across nodes, recent work
has shown both empirically
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids
do not significantly affect the convergence
speed compared to using denser topologies.

\begin{figure*}[t]
     \centering
     
     % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py ring/iid/all/2021-03-30-16:07:06-CEST ring/non-iid/all/2021-03-30-16:07:03-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/ring-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/ring-IID-vs-non-IID}
\caption{\label{fig:ring-IID-vs-non-IID} Ring topology}
     \end{subfigure}
     \quad
    % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py grid/iid/all/2021-03-30-16:07:01-CEST grid/non-iid/all/2021-03-30-16:06:59-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/grid-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/grid-IID-vs-non-IID}
\caption{\label{fig:grid-IID-vs-non-IID} Grid topology}
     \end{subfigure}
     \quad
         % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/iid/all/2021-03-30-16:07:20-CEST fully-connected/all/2021-03-10-09:25:19-CET  --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/fully-connected-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/fully-connected-IID-vs-non-IID}
\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected topology}
     \end{subfigure}
        \caption{Convergence speed of decentralized
        SGD with and without label distribution skew for different topologies.
        The task is logistic regression on MNIST (see
        Section~\ref{section:experimental-settings} for details on
        the experimental setup). Bold lines show the
        average test
        accuracy across nodes
        while thin lines show the minimum
        and maximum accuracy of individual nodes. While the effect of topology
        is negligible for homogeneous data, it is very significant in the
        heterogeneous case. On a fully-connected network, both cases converge
        similarly.}
        \label{fig:iid-vs-non-iid-problem}
\end{figure*}


\todo{AB: update fig legend to not use (non)IID terms}
In contrast to the homogeneous case however, our experiments demonstrate that 
\emph{the impact of topology is extremely significant for heterogeneous data}.
This phenomenon is illustrated in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
label distribution skew, using a
sparse topology (a ring or
a grid) clearly jeopardizes the convergence speed of decentralized SGD.
We stress the fact
that, unlike in centralized FL
\cite{mcmahan2016communication,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we thus address the following
question:

\textit{Can we design sparse topologies with  convergence
  speed similar to a  fully connected network for problems involving
  many participants with label distribution skew?}

Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint label distribution of each clique is close to that of the global 
distribution; (2) We design a greedy algorithm for
constructing such cliques efficiently;
% in the presence of heterogeneity previously studied
% in the context of Federated Learning~\cite{mcmahan2016communication};
 (3) We introduce Clique Averaging, a modified version of 
the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure that all
models converge, therefore reducing the bias introduced by inter-clique
connections; 
(4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the non-IID setting; (5) We 
demonstrate
through an extensive experimental study that our approach  removes the effect
of label distribution skew when training a linear
model and a deep
convolutional network on the MNIST %~\cite{mnistWebsite}
and CIFAR10 % ~\cite{krizhevsky2009learning}
datasets respectively ;  (6) Finally, we demonstrate the scalability of our
approach by considering  up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
of nodes
\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.

For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) to obtain a similar convergence speed as a fully-connected topology,
thereby yielding a 96\% reduction in the total number of required messages 
(37.8 messages per round per node on average instead of 999). Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further
potential gains at larger scales through a quasilinear $O(n
\log n)$ scaling in the number of nodes $n$.

The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes}
%We first present the problem
%statement and our methodology (Section~\ref{section:problem}). The D-Cliques
%design is presented in Section~\ref{section:d-cliques}) along with an
%empirical illustration of its benefits. In
%Section~\ref{section:clique-averaging-momentum}, we
%show how to further reduce bias with Clique Averaging and how to use it to
%implement momentum.  We present the results of our extensive experimental
%study in  Section~\ref{section:non-clustered}. We review some related work in
% Section~\ref{section:related-work}, and conclude with promising directions
% for future work in Section~\ref{section:conclusion}.