We note that recent work explores rings of server-based topologies
\cite{tornado}, but the focus is not on dealing with non-IID data but
to make server-based FL more scalable to a large number of clients.
\paragraph{Dealing with non-IID data in fully decentralized FL.}
\paragraph{Dealing with non-IID data in fully decentralized FL.}
Non-IID data is known to negatively impact the convergence speed
Non-IID data is known to negatively impact the convergence speed
of fully decentralized FL algorithms in practice \cite{jelasity}. This
of fully decentralized FL algorithms in practice \cite{jelasity}. Aside from approaches that aim to learn personalized models \cite{Vanhaesebrouck2017a,Zantedeschi2020a}, this
motivated the design of algorithms with modified updates based on variance
motivated the design of algorithms with modified updates based on variance
aggregation \cite{cross_gradient}, or multiple averaging steps
aggregation \cite{cross_gradient}, or multiple averaging steps
between updates (see \cite{consensus_distance} and references therein). These
between updates (see \cite{consensus_distance} and references therein). These
algorithms
algorithms
typically require additional communication and/or computation.\footnote{We
typically require additional communication and/or computation, and have been
only evaluated in small-scale networks with a few tens of nodes.\footnote{We
also observed that \cite{tang18a} is subject to numerical
also observed that \cite{tang18a} is subject to numerical
instabilities when run on topologies other than rings. When
instabilities when run on topologies other than rings. When
the rows and columns of $W$ do not exactly
the rows and columns of $W$ do not exactly
sum to $1$ (due to finite precision), these small differences get amplified by
sum to $1$ (due to finite precision), these small differences get amplified by
the proposed updates and make the algorithm diverge.}
the proposed updates and make the algorithm diverge.}
\aurelien{emphasize that they only do small scale experiments}
% non-IID known to be a problem for fully decentralized FL. cf Jelasity paper
% non-IID known to be a problem for fully decentralized FL. cf Jelasity paper
% D2 and other recent papers on modifying updates: Quasi-Global Momentum,
% D2 and other recent papers on modifying updates: Quasi-Global Momentum,
% Cross-Gradient Aggregation
% Cross-Gradient Aggregation
...
@@ -888,14 +892,13 @@ the proposed updates and make the algorithm diverge.}
...
@@ -888,14 +892,13 @@ the proposed updates and make the algorithm diverge.}
% D2 \cite{tang18a}: numerically unstable when $W_{ij}$ rows and columns do not exactly
% D2 \cite{tang18a}: numerically unstable when $W_{ij}$ rows and columns do not exactly
% sum to $1$, as the small differences are amplified in a positive feedback loop. More work is therefore required on the algorithm to make it usable with a wider variety of topologies. In comparison, D-cliques do not modify the SGD algorithm and instead simply removes some neighbor contributions that would otherwise bias the direction of the gradient. D-Cliques with D-PSGD are therefore as tolerant to ill-conditioned $W_{ij}$ matrices as regular D-PSGD in an IID setting.
% sum to $1$, as the small differences are amplified in a positive feedback loop. More work is therefore required on the algorithm to make it usable with a wider variety of topologies. In comparison, D-cliques do not modify the SGD algorithm and instead simply removes some neighbor contributions that would otherwise bias the direction of the gradient. D-Cliques with D-PSGD are therefore as tolerant to ill-conditioned $W_{ij}$ matrices as regular D-PSGD in an IID setting.
In contrast, D-Cliques focuses on the design of a sparse topology which is
In contrast, D-Cliques focuses on the design of a sparse topology which is
able to compensate for the effect of non-IID data. We do not modify the simple
able to compensate for the effect of non-IID data and scales to large
networks. We do not modify the simple
and efficient D-SGD
and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions
contributions
that would otherwise bias the direction of the gradient.
that would otherwise bias the direction of the gradient.
\aurelien{add personalized models - or merge all that in specific paragraph}
% An originality of our approach is to focus on the effect of topology
% An originality of our approach is to focus on the effect of topology
% level without significantly changing the original simple and efficient D-SGD
% level without significantly changing the original simple and efficient D-SGD
% algorithm \cite{lian2017d-psgd}. Other work to mitigate the effect of non-IID
% algorithm \cite{lian2017d-psgd}. Other work to mitigate the effect of non-IID