% !TEX root = main.tex \section{Related Work} \label{section:related-work} In this section, we review some related work on dealing with non-IID data in federated learning, and on the role of topology in fully decentralized algorithms. \paragraph{Dealing with non-IID data in server-based FL.} Non-IID data is not much of an issue in server-based FL if clients send their parameters to the server after each gradient update. Problems arise when one seeks to reduce the number of communication rounds by allowing each participant to perform multiple local updates, as in the popular FedAvg algorithm \cite{mcmahan2016communication}. Indeed, non-IID data can prevent such algorithms from converging to a good solution \cite{quagmire,scaffold}. This led to the design of algorithms that are specifically designed to mitigate the impact of non-IID data while performing multiple local updates, using adaptive client sampling \cite{quagmire}, update corrections \cite{scaffold} or regularization in the local objective \cite{fedprox}. Another direction is to embrace the non-IID scenario by learning personalized models for each client \cite{smith2017federated,perso_fl_mean,maml,moreau}. We note that recent work explores rings of server-based topologies \cite{tornado}, but the focus is not on dealing with non-IID data but to make server-based FL more scalable to a large number of clients. \paragraph{Dealing with non-IID data in fully decentralized FL.} Non-IID data is known to negatively impact the convergence speed of fully decentralized FL algorithms in practice \cite{jelasity}. Aside from approaches that aim to learn personalized models \cite{Vanhaesebrouck2017a,Zantedeschi2020a}, this motivated the design of algorithms with modified updates based on variance reduction \cite{tang18a}, momentum correction \cite{momentum_noniid}, cross-gradient aggregation \cite{cross_gradient}, or multiple averaging steps between updates (see \cite{consensus_distance} and references therein). These algorithms typically require significantly more communication and/or computation, and have only been evaluated on small-scale networks with a few tens of nodes.\footnote{We also observed that \cite{tang18a} is subject to numerical instabilities when run on topologies other than rings. When the rows and columns of $W$ do not exactly sum to $1$ (due to finite precision), these small differences get amplified by the proposed updates and make the algorithm diverge.} In contrast, D-Cliques focuses on the design of a sparse topology which is able to compensate for the effect of non-IID data and scales to large networks. We do not modify the simple and efficient D-SGD algorithm \cite{lian2017d-psgd} beyond removing some neighbor contributions that otherwise bias the gradient direction. \paragraph{Impact of topology in fully decentralized FL.} It is well known that the choice of network topology can affect the convergence of fully decentralized algorithms. In theoretical convergence rates, this is typically accounted for by a dependence on the spectral gap of the network, see for instance \cite{Duchi2012a,Colin2016a,lian2017d-psgd,Nedic18}. However, for IID data, practice contradicts these classic results as fully decentralized algorithms have been observed to converge essentially as fast on sparse topologies like rings or grids as they do on a fully connected network \cite{lian2017d-psgd,Lian2018}. Recent work \cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically smaller in the IID case. However, these results do not give any clear insight regarding the role of the topology in the non-IID case. We note that some work has gone into designing efficient topologies to optimize the use of network resources (see e.g., \cite{marfoq}), but the topology is chosen independently of how data is distributed across nodes. In summary, the role of topology in the non-IID data scenario is not well understood and we are not aware of prior work focusing on this question. Our work is the first to show that an appropriate choice of data-dependent topology can effectively compensate for non-IID data.