However, for IID data, practice contradicts these classic
% mention Neglia and empirical results for IID data, probably also Consensus
results: fully decentralized algorithms converge essentially as fast
% Control paper which does not allow to analyze the effect of topology.
on sparse topologies like rings or grids as they do on a fully connected
% can mention Marfoq paper on topology design but to optimize network
graph \cite{lian2017d-psgd,Lian2018}. Recent work
% resources, independent of data
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
% conclusion: role of topology in non-IID is not understood / has not
smaller in the IID case. However, these results do not give any clear insight
% been much studied before our work.
regarding the role of the topology in the non-IID case. We note that some work
has gone into designing efficient topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but this is done independently
of how data is distributed across nodes. In summary, the role
of topology in the
non-IID data scenario is
not well understood and we are not aware of prior work focusing on this
question.
\paragraph{Dealing with non-IID data in server-based FL.}
\paragraph{Dealing with non-IID data in server-based FL.}
Dealing with non-IID data in server-based FL has
% scaffold, quagmire, fedprox, etc
recently attracted a lot of interest. While non-IID data is not an issue if
% also personalized models: Smith etc
clients send their parameters to the server after each gradient update,
problems arise when one seeks to reduce
the number of communication rounds by allowing each participant to perform
multiple local updates, as in the popular FedAvg algorithm
\cite{mcmahan2016communication}. This led to the design of extensions that are
specifically designed to mitigate the impact of non-IID data when performing
multiple local updates, using adaptive sampling \cite{quagmire}, update
corrections \cite{scaffold} or regularization in the local objective
\cite{fedprox}. Another direction is to embrace the non-IID scenario by
aggregation \cite{cross_gradient}, or multiple averaging steps
between updates (see \cite{consensus_distance} and references therein). These
algorithms
typically require additional communication and/or computation.\footnote{We
also observed that \cite{tang18a} is subject to numerical
instabilities when run on topologies other than rings and grids. When
the rows and columns of $W$ do not exactly
sum to $1$ (due to finite precision), these small differences get amplified by
the proposed updates and make the algorithm diverge.}Z
% non-IID known to be a problem for fully decentralized FL. cf Jelasity paper
% non-IID known to be a problem for fully decentralized FL. cf Jelasity paper
% D2 and other recent papers on modifying updates: Quasi-Global Momentum,
% D2 and other recent papers on modifying updates: Quasi-Global Momentum,
% Cross-Gradient Aggregation
% Cross-Gradient Aggregation
% papers using multiple averaging steps
% papers using multiple averaging steps
% also our personalized papers
% also our personalized papers
% D2 \cite{tang18a}: numerically unstable when $W_{ij}$ rows and columns do not exactly
D2 \cite{tang18a}: numerically unstable when $W_{ij}$ rows and columns do not exactly
% sum to $1$, as the small differences are amplified in a positive feedback loop. More work is therefore required on the algorithm to make it usable with a wider variety of topologies. In comparison, D-cliques do not modify the SGD algorithm and instead simply removes some neighbor contributions that would otherwise bias the direction of the gradient. D-Cliques with D-PSGD are therefore as tolerant to ill-conditioned $W_{ij}$ matrices as regular D-PSGD in an IID setting.
sum to $1$, as the small differences are amplified in a positive feedback loop. More work is therefore required on the algorithm to make it usable with a wider variety of topologies. In comparison, D-cliques do not modify the SGD algorithm and instead simply removes some neighbor contributions that would otherwise bias the direction of the gradient. D-Cliques with D-PSGD are therefore as tolerant to ill-conditioned $W_{ij}$ matrices as regular D-PSGD in an IID setting.
In contrast, D-cliques focuses on the design of a sparse topology which is
able to compensate for the effect of non-IID data. We do not modify the simple
An originality of our approach is to focus on the effect of topology
and efficient D-SGD
level without significantly changing the original simple and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
algorithm \cite{lian2017d-psgd}. Other work to mitigate the effect of non-IID
contributions
data on decentralized algorithms are based on performing modified updates (eg
that would otherwise bias the direction of the gradient.
with variance reduction) or multiple averaging steps.
\aurelien{add personalized models - or merge all that in specific paragraph}
% An originality of our approach is to focus on the effect of topology
% level without significantly changing the original simple and efficient D-SGD
% algorithm \cite{lian2017d-psgd}. Other work to mitigate the effect of non-IID
% data on decentralized algorithms are based on performing modified updates (eg
% with variance reduction) or multiple averaging steps.