sec 6

7dc1ba40 · aurelien.bellet · dc012f43 · 7dc1ba40
Commit 7dc1ba40 authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -677,9 +677,10 @@ with the centralized setting.

 %AMK: add what is in there
 In this section, we first compare D-Cliques to alternative topologies to
-confirm our main design choices. Then,
+confirm the relevance of our main design choices. Then,
 we evaluate some extensions of D-Cliques to further reduce the number of
-inter-clique connections so as to scale even better with the number of nodes.
+inter-clique connections so as to gracefully scale with the number of
+nodes.

 \subsection{Comparing D-Cliques to Other Sparse Topologies} %Non-Clustered
 % Topologies}
@@ -688,31 +689,33 @@ inter-clique connections so as to scale even better with the number of nodes.

 %We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
 We demonstrate the advantages of D-cliques over alternative sparse topologies
-with a
-similar number of edges. First, we consider topologies where the neighbors
-of each node are selected at random without any clique structure.
+that have a similar number of edges. First, we consider topologies in which
+the neighbors of each node are selected at random (hence without any clique
+structure).
 Specifically, for $n=100$ nodes, we
 construct a random topology such that each node has exactly 10 edges, which is
-similar to the average 9.9 edges for our previous D-Cliques example 
-(Fig.~\ref{fig:d-cliques-figure}). To better understand the importance of the
-clique structure independently of the class representativity among neighbors,
-we also compare to a similar random topology
-where edges are
+similar to the average 9.9 edges of our D-Cliques topology 
+(Figure~\ref{fig:d-cliques-figure}). To better understand the role of
+the clique structure beyond merely ensuring class representativity among
+neighbors,
+we also compare to a random topology similar to the one described above except
+that edges are
 chosen such that each node has neighbors of all possible classes. Finally, we
 also implement an analog of Clique Averaging for these random topologies,
-where all nodes de-bias their gradient with that of their neighbors. In the
-latter case, since nodes do not form a clique, none of the nodes actually
-computes the same resulting average gradient.
+where all nodes de-bias their gradient based on the class distribution of
+their neighbors. In the latter case, since nodes do not form a clique, each
+node obtains a different average gradient.

 The results for MNIST and CIFAR10 are shown in
 Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST,
 a purely random topology has higher variance and lower convergence speed than
-D-Cliques, with or without Clique Averaging, while a random topology with
+D-Cliques (with or without Clique Averaging), while a random topology with
 class representativity performs similarly as D-Cliques without Clique
 Averaging. However and perhaps surprisingly, a random topology with unbiased
 gradient performs slightly worse than without it. In any case, D-Cliques with
-Clique Averaging performs better than any other random topology, showing
-that the clique structure has a small but significant effect in this setup.
+Clique Averaging outperforms all random topologies, showing that the clique
+structure has a small but noticeable effect on the average accuracy and
+significantly reduces the variance across nodes in this setup.

 \begin{figure}[t]
     \centering     
@@ -734,13 +737,15 @@ that the clique structure has a small but significant effect in this setup.
 \caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies} 
 \end{figure}

-On the harder CIFAR10 dataset, the differences are much more dramatic:
-D-Cliques with Clique Averaging and momentum is critical for good convergence.
+On the harder CIFAR10 dataset with a deep convolutional network, the
+differences are much more dramatic:
+D-Cliques with Clique Averaging and momentum turns out to be critical for fast
+convergence.
 Crucially, all random topologies fail to converge to a good solution. This
 confirms that our clique structure is important to reduce variance
 across nodes and improve the convergence. The difference with the previous
-experiment seems to be due to both the use of a higher capacity model with
-local optima and to the intrinsic characteristics of the datasets. We refer
+experiment seems to be due to both the use of a higher capacity model and to
+the intrinsic characteristics of the datasets. We refer
 to the appendix for results on MNIST with LeNet.
 % We have tried to use LeNet on
 % MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
@@ -750,14 +755,14 @@ instrumental in obtaining good performance, one may wonder whether
 intra-clique full connectivity is actually necessary.
 %AMK: check sentence above: justify
 Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of
-D-Cliques when cliques have been sparsified by randomly removing 1
-or
-5 edges per clique (out of 45). Strikingly, both for MNIST and
-CIFAR10, sparsifying the cliques even slightly has significant effect on the
-convergence speed. In the case of CIFAR10, it even entirely negates the
+a D-Cliques topology where cliques have been sparsified by randomly
+removing 1 or 5 edges per clique (out of 45). Strikingly, both for MNIST and
+CIFAR10, removing just a single edge from the cliques has a
+significant effect on the
+convergence speed. On CIFAR10, it even entirely negates the
 benefits of D-Cliques.

-Overall, these experiments show that achieving fast convergence on non-IID
+Overall, these results show that achieving fast convergence on non-IID
 data with sparse topologies requires a very careful design, as we have
 proposed with D-Cliques.

@@ -785,60 +790,69 @@ proposed with D-Cliques.
 \end{figure}

 %\section{Scaling with Different Inter-Clique Topologies}
-\subsection{Scaling up with D-Cliques Extensions}
+\subsection{Scaling up D-Cliques with Sparser Inter-Clique Topologies}
 %with Different Inter-Clique Topologies}
 \label{section:interclique-topologies}


 So far, we have used a fully-connected inter-clique topology for D-Cliques,
-which bounds the
-average shortest path to $2$ between any pair of nodes. This uses $\frac{n}{c}
-(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically in the
-number of nodes. This can become significant at larger scales when $n$ is
+which has the advantage of bounding the
+average shortest path to $2$ between any pair of nodes. This choice requires $
+\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
+in the number of nodes. This can become significant at larger scales when $n$ is
 large compared to $c$.

-In this last series of experiment, we evaluate the effect of the choice of 
-inter-clique topology on the convergence speed for a larger network of 1000
+In this last series of experiment, we evaluate the effect of choosing sparser
+inter-clique topologies on the convergence speed for a larger network of 1000
 nodes. We compare the scalability and convergence speed of several
 D-Cliques variants, which all use $O(nc)$ edges
 to create cliques as a starting point.

-The inter-clique topology with (almost) fewest edges is a \textit{ring}, which
-uses $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in $O
-(n)$.
-Another topology scales linearly with a logarithmic bound on the
-average shortest number of hops between two nodes: we call it
-\textit{fractal}. In this hierarchical scheme, cliques are
+The inter-clique topology with (almost) fewest possible edges is a 
+\textit{ring}, which
+uses $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in
+$n$.
+We also consider another topology that scales linearly and achieves a
+logarithmic
+bound on the
+average shortest number of hops between two nodes. In this hierarchical scheme
+that we call \textit{fractal}, cliques are
 assembled in
-larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group at the next level up. This results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
+larger groups of $c$ cliques that are connected internally with one edge per
+pair of cliques, but with only one edge between pairs of larger groups. The
+topology is built recursively such that $c$ groups will themselves form a
+larger group at the next level up. This results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.

 Finally, we propose to connect cliques according to a
-small-world-like~\cite{watts2000small} topology applied on top of a
+small-world-like topology~\cite{watts2000small} applied on top of a
 ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
 ring. Then each clique adds symmetric edges, both clockwise and
 counter-clockwise on the ring, with the $ns$ closest cliques in sets of
 cliques that are exponentially bigger the further they are on the ring (see
 Algorithm~\ref{Algorithm:Smallworld} in the appendix for
-details on the construction). This ensures a good clustering with other
+details on the construction). This ensures a good connectivity with other
 cliques that are close on the ring, while still keeping the average shortest
 path small. This scheme uses $2(ns)log(\frac{n}{c})$ inter-clique edges and
 therefore grows in the order of $O(n + log(n))$ with the number of nodes.

-Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence speed
-for all schemes on MNIST and CIFAR10, compared to the ideal baseline of a
-single IID node performing the same number of updates per epoch (showing the
-fastest convergence speed achievable if topology had no impact). The ring
+Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence
+speed of all the above schemes on MNIST and CIFAR10, compared to the ideal
+baseline
+of a
+single IID node performing the same number of updates per epoch (representing
+the fastest convergence speed achievable if topology had no impact). The ring
 topology converges but is much slower, while our fractal scheme helps
-significantly. The sweet spot appears to be with the small-world
+significantly. The sweet spot appears to be the small-world
 topology, as the convergence speed is almost the same as with a
 fully-connected inter-clique topology but with 22\% less edges
-(14.5 edges on average instead of 18.9). We can also expect bigger gains at
-larger scales. Nonetheless, even the fully-connected topology offers
+(14.5 edges on average instead of 18.9). Note that we can expect bigger
+gains at larger scales. Nonetheless, we stress the fact that even the
+fully-connected topology offers
 significant benefits with 1000 nodes, as it represents a 98\% reduction in the
 number of edges compared to fully connecting individual nodes (18.9 edges on
 average instead of 999) and a 96\% reduction in the number of messages (37.8
 messages per round per node on average instead of 999). Overall, these results
-show that D-Cliques can scale nicely with the number of nodes.
+show that D-Cliques can nicely scale with the number of nodes.

 \begin{figure}[t]
     \centering
@@ -865,19 +879,20 @@ show that D-Cliques can scale nicely with the number of nodes.
 \label{section:related-work}

 In this section, we review some related work on dealing with non-IID data in
-FL, and on the role of topology in fully decentralized algorithms.
+federated learning, and on the role of topology in fully decentralized
+algorithms.

 \paragraph{Dealing with non-IID data in server-based FL.}
-While non-IID data is not an issue in server-based FL if
-clients send their parameters to the server after each gradient update,
-problems arise when one seeks to reduce
+Non-IID data is not much of an issue in server-based FL if
+clients send their parameters to the server after each gradient update.
+Problems arise when one seeks to reduce
 the number of communication rounds by allowing each participant to perform
 multiple local updates, as in the popular FedAvg algorithm 
-\cite{mcmahan2016communication}. Indeed, non-IID data can prevent the
-algorithm from
-converging to a good solution in this case. This led to the design of
-extensions that are specifically designed to mitigate the impact of non-IID
-data when performing
+\cite{mcmahan2016communication}. Indeed, non-IID data can prevent
+such algorithms from
+converging to a good solution \cite{quagmire,scaffold}. This led to the design
+of algorithms that are specifically designed to mitigate the impact
+of non-IID data while performing
 multiple local updates, using adaptive client sampling \cite{quagmire}, update
 corrections \cite{scaffold} or regularization in the local objective 
 \cite{fedprox}. Another direction is to embrace the non-IID scenario by
@@ -896,8 +911,9 @@ cross-gradient
 aggregation \cite{cross_gradient}, or multiple averaging steps
 between updates (see \cite{consensus_distance} and references therein). These
 algorithms
-typically require additional communication and/or computation, and have been
-only evaluated in small-scale networks with a few tens of nodes.\footnote{We
+typically require significantly more communication and/or computation, and
+have only been evaluated on small-scale networks with a few tens of
+nodes.\footnote{We
 also observed that \cite{tang18a} is subject to numerical
 instabilities when run on topologies other than rings. When
 the rows and columns of $W$ do not exactly
@@ -928,22 +944,25 @@ that would otherwise bias the direction of the gradient.
 \paragraph{Impact of topology in fully decentralized FL.} It is well
 known
 that the choice of network topology can affect the
-convergence of fully decentralized algorithms: this is typically accounted
-for in the theoretical convergence rate by a dependence on the spectral gap of
+convergence of fully decentralized algorithms. In theoretical convergence
+rates, this is typically accounted
+for by a dependence on the spectral gap of
 the network, see for instance 
 \cite{Duchi2012a,Colin2016a,lian2017d-psgd,Nedic18}.
 However, for IID data, practice contradicts these classic
-results: fully decentralized algorithms converge essentially as fast
+results as fully decentralized algorithms have been observed to converge
+essentially as fast
 on sparse topologies like rings or grids as they do on a fully connected
 network \cite{lian2017d-psgd,Lian2018}. Recent work 
 \cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
 smaller in the IID case. However, these results do not give any clear insight
 regarding the role of the topology in the non-IID case. We note that some work
 has gone into designing efficient topologies to optimize the use of
-network resources (see e.g., \cite{marfoq}), but this is done independently
-of how data is distributed across nodes. In summary, the role
+network resources (see e.g., \cite{marfoq}), but the topology is chosen
+independently of how data is distributed across nodes. In summary, the role
 of topology in the non-IID data scenario is not well understood and we are not
-aware of prior work focusing on this question. Our work shows that an
+aware of prior work focusing on this question. Our work shows is the first
+to show that an
 appropriate choice of data-dependent topology can effectively compensate for
 non-IID data.