related_work.tex

% !TEX root = main.tex

\section{Related Work}
\label{section:related-work}

In this section, we review some related work on dealing with non-IID data in
federated learning, and on the role of topology in fully decentralized
algorithms.

\paragraph{Dealing with non-IID data in server-based FL.}
Non-IID data is not much of an issue in server-based FL if
clients send their parameters to the server after each gradient update.
Problems arise when one seeks to reduce
the number of communication rounds by allowing each participant to perform
multiple local updates, as in the popular FedAvg algorithm 
\cite{mcmahan2016communication}. Indeed, non-IID data can prevent
such algorithms from
converging to a good solution \cite{quagmire,scaffold}. This led to the design
of algorithms that are specifically designed to mitigate the impact
of non-IID data while performing
multiple local updates, using adaptive client sampling \cite{quagmire}, update
corrections \cite{scaffold} or regularization in the local objective 
\cite{fedprox}. Another direction is to embrace the non-IID scenario by
learning personalized models for each client 
\cite{smith2017federated,perso_fl_mean,maml,moreau}.
We note that recent work explores rings of server-based topologies 
\cite{tornado}, but the focus is not on dealing with non-IID data but
to make server-based FL more scalable to a large number of clients.

\paragraph{Dealing with non-IID data in fully decentralized FL.}
Non-IID data is known to negatively impact the convergence speed
of fully decentralized FL algorithms in practice \cite{jelasity}. Aside from approaches that aim to learn personalized models \cite{Vanhaesebrouck2017a,Zantedeschi2020a}, this
motivated the design of algorithms with modified updates based on variance
reduction \cite{tang18a}, momentum correction \cite{momentum_noniid},
cross-gradient
aggregation \cite{cross_gradient}, or multiple averaging steps
between updates (see \cite{consensus_distance} and references therein). These
algorithms
typically require significantly more communication and/or computation, and
have only been evaluated on small-scale networks with a few tens of
nodes.\footnote{We
also observed that \cite{tang18a} is subject to numerical
instabilities when run on topologies other than rings. When
the rows and columns of $W$ do not exactly
sum to $1$ (due to finite precision), these small differences get amplified by
the proposed updates and make the algorithm diverge.}
In contrast, D-Cliques focuses on the design of a sparse topology which is
able to compensate for the effect of non-IID data and scales to large
networks. We do not modify the simple
and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions
that otherwise bias the gradient direction.

\paragraph{Impact of topology in fully decentralized FL.} It is well
known
that the choice of network topology can affect the
convergence of fully decentralized algorithms. In theoretical convergence
rates, this is typically accounted
for by a dependence on the spectral gap of
the network, see for instance 
\cite{Duchi2012a,Colin2016a,lian2017d-psgd,Nedic18}.
However, for IID data, practice contradicts these classic
results as fully decentralized algorithms have been observed to converge
essentially as fast
on sparse topologies like rings or grids as they do on a fully connected
network \cite{lian2017d-psgd,Lian2018}. Recent work 
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the IID case. However, these results do not give any clear insight
regarding the role of the topology in the non-IID case. We note that some work
has gone into designing efficient topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but the topology is chosen
independently of how data is distributed across nodes. In summary, the role
of topology in the non-IID data scenario is not well understood and we are not
aware of prior work focusing on this question. Our work is the first
to show that an
appropriate choice of data-dependent topology can effectively compensate for
non-IID data.