intro / related work

771a3cb4 · aurelien.bellet · 7d2064de · 771a3cb4 · 771a3cb4 · 771a3cb4
Commit 771a3cb4 authored 3 years ago by aurelien.bellet
--- a/icdcs22/intro.tex
+++ b/icdcs22/intro.tex
@@ -25,7 +25,7 @@ other words, they are
 federated classification problems, known as \emph{label distribution skew} 
 \cite{kairouz2019advances,quagmire}, occurs when the frequency of different
 classes of examples varies significantly across local datasets.
-One of the key challenges in FL is to design algorithms that
+A key challenge in FL is to design algorithms that
 can efficiently deal with such heterogeneous data distributions
 \cite{kairouz2019advances,fedprox,scaffold,quagmire}.

@@ -103,20 +103,26 @@ In contrast to the homogeneous case however, our experiments demonstrate that
 This phenomenon is illustrated in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that under
 label distribution skew, using a
 sparse topology (a ring or
-a grid) clearly jeopardizes the convergence speed of decentralized SGD.
-We stress the fact
-that, unlike in centralized FL
+a grid) clearly jeopardizes the convergence speed of decentralized
+SGD.\footnote{Unlike in centralized FL
 \cite{mcmahan2016communication,scaffold,quagmire}, this
 happens even when nodes perform a single local update before averaging the
-model with their neighbors. In this paper, we thus address the following
+model with their neighbors.} This strong impact of
+data heterogeneity and its interplay with the choice of network
+topology is not explained by current theoretical
+analyses of
+decentralized FL, which model heterogeneity by some (unknown)
+constant that bounds the variance of local gradients, independently of the
+topology \cite{lian2017d-psgd,Lian2018,neglia2020,ying2021exponential}
+In this paper, we thus address the following
 question:

 \textit{Can we design sparse topologies with  convergence
  speed similar to a  fully connected network for problems involving
  many participants with label distribution skew?}
  
-Moreover, common analysis techniques to compare the asymptotic convergence behavior
-of different topologies assume the impact on gradients is bounded by an unknown constant  (e.g.~\cite{ying2021exponential}), and therefore ignore it. For example, asymptotic analysis suggests that an expander graph, a sparse topology whose number of edges per node scales logarithmically with the number of nodes,  achieves exact averaging~\cite{ying2021exponential}: if this were the case, this would remove the effect of the topology on distributed averaging and reduce D-SGD to FedSGD~\cite{mcmahan2016communication} regardless of how data is partitioned. However, in practice as we will show, not only is the effect of data heterogeneity sufficient to prevent exact averaging, a different topology using as many or less edges can converge faster. In the rest of this paper we therefore rely on rigorous, repeatable practical experiments, instead of asymptotic convergence analysis, to accurately compare the convergence speed of a wide diversity of topologies. This provides a stronger basis for quantifying the effect of data heterogeneity and should motivate future work in adapting analysis techniques to correctly include it. 
+% Moreover, common analysis techniques to compare the asymptotic convergence behavior
+% of different topologies assume the impact on gradients is bounded by an unknown constant  (e.g.~\cite{ying2021exponential}), and therefore ignore it. For example, asymptotic analysis suggests that an expander graph, a sparse topology whose number of edges per node scales logarithmically with the number of nodes,  achieves exact averaging~\cite{ying2021exponential}: if this were the case, this would remove the effect of the topology on distributed averaging and reduce D-SGD to FedSGD~\cite{mcmahan2016communication} regardless of how data is partitioned. However, in practice as we will show, not only is the effect of data heterogeneity sufficient to prevent exact averaging, a different topology using as many or less edges can converge faster. In the rest of this paper we therefore rely on rigorous, repeatable practical experiments, instead of asymptotic convergence analysis, to accurately compare the convergence speed of a wide diversity of topologies. This provides a stronger basis for quantifying the effect of data heterogeneity and should motivate future work in adapting analysis techniques to correctly include it. 

 Specifically, we make the following contributions:
 (1) We propose D-Cliques, a sparse topology in which nodes are organized in
@@ -132,16 +138,14 @@ optimizing local models, from distributed averaging, used to ensure that all
 models converge, thereby reducing the bias introduced by inter-clique
 connections; 
 (4) We show how Clique Averaging can be used to implement unbiased momentum
-that would otherwise be detrimental in the heterogeneous setting; (5) We 
-demonstrate
-through an extensive experimental study that our approach  removes the effect
-of label distribution skew when training a linear
-model and a deep
-convolutional network on the MNIST %~\cite{mnistWebsite}
-and CIFAR10 % ~\cite{krizhevsky2009learning}
-datasets respectively; (5) We show that D-Cliques converge faster than a static
-undirected expander graph, otherwise thought to provide asymptotic exact averaging,
-showing that data heterogeneity should not be ignored in convergence analysis;  
+that would otherwise be detrimental in the heterogeneous setting; (5) Through
+an extensive experimental study on decentralized learning of linear
+models and deep
+convolutional networks on MNIST %~\cite{mnistWebsite}
+and CIFAR10 datasets, % ~\cite{krizhevsky2009learning}
+we validate our various design choices and
+demonstrate that our approach is able to remove the effect
+of label distribution skew while maintaining a sparse topology;
 (6) Finally, we demonstrate the scalability of our
 approach by considering  up to 1000-node networks, in contrast to most
 previous work on fully decentralized learning which performs empirical
@@ -154,11 +158,15 @@ For instance, our results show that under strong label distribution skew,
 using D-Cliques in a 1000-node network
 requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) to obtain a similar convergence speed as a fully-connected topology,
 thereby yielding a 96\% reduction in the total number of required messages 
-(37.8 messages per round per node on average instead of 999). Furthermore an additional 22\% improvement
+(37.8 messages per round per node on average instead of 999). An additional 22\% improvement
 % (14.5 edges per node on average instead of 18.9)
 is possible when using a small-world inter-clique topology, with further
 potential gains at larger scales through a quasilinear $O(n
 \log n)$ scaling in the number of nodes $n$.
+We also show that D-Cliques empirically provide faster and more
+robust convergence than random graphs
+(such as the exponential graphs recently promoted in 
+\cite{ying2021exponential}) with a similar number of edges.

 The rest of this paper is organized as follows.
 We first describe the problem setting in Section~\ref{section:problem}. We

--- a/icdcs22/main.tex
+++ b/icdcs22/main.tex
@@ -100,7 +100,8 @@ cliques such that the label distribution in a clique is representative
 of the global label distribution. We also show how to adapt the updates of
 decentralized SGD
 to obtain unbiased gradients and implement an effective momentum with
-D-Cliques. Our extensive empirical evaluation on MNIST and CIFAR10 demonstrates that our approach
+D-Cliques. Our extensive empirical evaluation on MNIST and CIFAR10
+validates our design and demonstrates that our approach
 achieves similar convergence speed as a fully-connected topology,
 % , the latter providing the best possible convergence
 %  in a data heterogeneous setting,

--- a/icdcs22/related_work.tex
+++ b/icdcs22/related_work.tex
@@ -67,13 +67,19 @@ on sparse topologies like rings or grids as they do on a fully connected
 network \cite{lian2017d-psgd,Lian2018}. Recent work 
 \cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
 smaller in the homogeneous case. However, these results do not give any clear insight
-regarding the role of the topology in the presence of heterogeneous data. 
+regarding the role of the topology in the presence of heterogeneous data.
+Indeed, in all of the above analyses, the impact of data
+heterogeneity
+is abstracted away  through (unknown) constants
+that bound the variance of local gradients.
 We note that some work
-has gone into designing efficient topologies to optimize the use of
+has gone into designing topologies to optimize the use of
 network resources (see e.g., \cite{marfoq}), but the topology is chosen
-independently of how data is distributed across nodes. In summary, the role
-of topology in the heterogeneous data scenario is not well understood and we are not
-aware of prior work focusing on this question. Our work is the first
+independently of how data is distributed across nodes.
+
+In summary, the interplay between the network topology and data heterogeneity
+is not well understood and we are not aware of prior work focusing on this
+question. Our work is the first
 to show that an
 appropriate choice of data-dependent topology can effectively compensate for
 heterogeneous data.
\ No newline at end of file