sec 5

a8f716fd · aurelien.bellet · 5454035a · a8f716fd
Commit a8f716fd authored 4 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -183,7 +183,7 @@ model with their neighbors. In this paper, we address the following question:
 %\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}

 \textit{Can we design sparse topologies with  convergence
-  speed similar to the one obtained in a  fully connected network graph under
+  speed similar to the one obtained in a  fully connected network under
  a large number of participants with local class bias?}
 %AMK: do we talk about local class bias or noniidness?

@@ -657,19 +657,45 @@ with the centralized setting.
 \label{section:non-clustered}

 %AMK: add what is in there
-In this section, we first compare D-Cliques to alternatives, we then further evaluate the impact of using Clique-Averaging and evaluate D-Cliques-based extensions.
+In this section, we first compare D-Cliques to alternative topologies to
+confirm our main design choices. Then,
+we evaluate some extensions of D-Cliques to further reduce the number of
+inter-clique connections so as to scale even better with the number of nodes.

-\subsection{Comparing D-Cliques to alternatives} %Non-Clustered Topologies}
+\subsection{Comparing D-Cliques to Other Sparse Topologies} %Non-Clustered
+% Topologies}

 %\label{section:non-clustered}

 %We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
-We compare D-cliques against competitors to demonstrate its  advantages  over alternative topologies.
-First, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbors of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbors. In the latter case, since nodes do not form a clique, no node actually compute the same resulting average gradient.
+We demonstrate the advantages of D-cliques over alternative sparse topologies
+with a
+similar number of edges. First, we consider topologies where the neighbors
+of each node are selected at random without any clique structure.
+Specifically, for $n=100$ nodes, we
+construct a random topology such that each node has exactly 10 edges, which is
+similar to the average 9.9 edges for our previous D-Cliques example 
+(Fig.~\ref{fig:d-cliques-figure}). To better understand the importance of the
+clique structure independently of the class representativity among neighbors,
+we also compare to a similar random topology
+where edges are
+chosen such that each node has neighbors of all possible classes. Finally, we
+also implement an analog of Clique Averaging for these random topologies,
+where all nodes de-bias their gradient with that of their neighbors. In the
+latter case, since nodes do not form a clique, none of the nodes actually
+computes the same resulting average gradient.
+
+The results for MNIST and CIFAR10 are shown in
+Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST,
+a purely random topology has higher variance and lower convergence speed than
+D-Cliques, with or without Clique Averaging, while a random topology with
+class representativity performs similarly as D-Cliques without Clique
+Averaging. However and perhaps surprisingly, a random topology with unbiased
+gradient performs slightly worse than without it. In any case, D-Cliques with
+Clique Averaging performs better than any other random topology, showing
+that the clique structure has a small but significant effect in this setup.

-Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse  than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems that clustering in this case has a small but significant effect.
-
-\begin{figure}[htbp]
+\begin{figure}[t]
     \centering     
         \begin{subfigure}[b]{0.48\textwidth}
 % To regenerate the figure, from directory results/mnist
@@ -689,15 +715,34 @@ Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-
 \caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies} 
 \end{figure}

-For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
-
-\subsection{Importance of Intra-Clique Full Connectivity}
-\label{section:intra-clique-connectivity}
-
-Intra-clique full connectivity is also necessary.
+On the harder CIFAR10 dataset, the differences are much more dramatic:
+D-Cliques with Clique Averaging and momentum is critical for good convergence.
+Crucially, all random topologies fail to converge to a good solution. This
+confirms that our clique structure is important to reduce variance
+across nodes and improve the convergence. The difference with the previous
+experiment seems to be due to both the use of a higher capacity model with
+local optima and to the intrinsic characteristics of the datasets. We refer
+to the appendix for results on MNIST with LeNet.
+% We have tried to use LeNet on
+% MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
+
+While the previous experiments suggest that our clique structure is
+instrumental in obtaining good performance, one may wonder whether
+intra-clique full connectivity is actually necessary.
 %AMK: check sentence above: justify
-Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-Cliques with respectively 1 and 5 edges randomly removed, out of 45 (2 and 10 out of 90 if counting both direction separately), as well as with and without Clique Averaging (resulting in a biased average gradient within cliques). In all cases, both for MNIST and CIFAR10, it has significant effect on the convergence speed. In the case of CIFAR10, it also negates the benefits of D-Cliques. 
-\begin{figure}[htbp]
+Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of
+D-Cliques when cliques have been sparsified by randomly removing 1
+or
+5 edges per clique (out of 45). Strikingly, both for MNIST and
+CIFAR10, sparsifying the cliques even slightly has significant effect on the
+convergence speed. In the case of CIFAR10, it even entirely negates the
+benefits of D-Cliques.
+
+Overall, these experiments show that achieving fast convergence on non-IID
+data with sparse topologies requires a very careful design, as we have
+proposed with D-Cliques.
+
+\begin{figure}[t]
     \centering

 \begin{subfigure}[htbp]{0.48\textwidth}
@@ -721,23 +766,62 @@ Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-C
 \end{figure}

 %\section{Scaling with Different Inter-Clique Topologies}
-\subsection{Scaling with D-Cliques extensions}
+\subsection{Scaling up with D-Cliques Extensions}
 %with Different Inter-Clique Topologies}
 \label{section:interclique-topologies}

-We finally evaluate the effect of the inter-clique topology on convergence speed on a larger network of 1000 nodes. We compare the scalability and convergence speed of variants based on D-Cliques, and therefore all using $O(nc)$ edges to create cliques as a foundation, where $n$ is the number of nodes and $c$ is the size of a clique.
-
-First, the scheme that uses the fewest (almost\footnote{A path uses one less edge at significantly slower convergence speed and is therefore never really used in practice.}) number of extra edges is a \textit{ring}. A ring adds $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in $O(n)$.
-
-We introduce a second scheme that scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, as nodes are added, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
-
-Third, we propose to connect cliques according to a smallworld-like~\cite{watts2000small} topology, applied to a ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a ring. Then each clique add symmetric edges, both clockwise and counter-clockwise on the ring, to the $ns$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring.\footnote{See Algorithm~\ref{Algorithm:Smallworld} in Appendix for a detailed listing.} This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small. This scheme adds a $2(ns)log(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes.

-Finally, we also fully connect cliques together, which bounds the average shortest path to $2$ between any pair of nodes. This adds $\frac{n}{c}(\frac{n}{c} - 1)$ edges, which scales quadratically in the number of nodes, in $O(\frac{n^2}{c^2})$, which can be significant at larger scales when $n$ is large compared to $c$.
+So far, we have used a fully-connected inter-clique topology for D-Cliques,
+which bounds the
+average shortest path to $2$ between any pair of nodes. This uses $\frac{n}{c}
+(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically in the
+number of nodes. This can become significant at larger scales when $n$ is
+large compared to $c$.
+
+In this last series of experiment, we evaluate the effect of the choice of 
+inter-clique topology on the convergence speed for a larger network of 1000
+nodes. We compare the scalability and convergence speed of several
+D-Cliques variants, which all use $O(nc)$ edges
+to create cliques as a starting point.
+
+The inter-clique topology with (almost) fewest edges is a \textit{ring}, which
+uses $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in $O
+(n)$.
+Another topology scales linearly with a logarithmic bound on the
+average shortest number of hops between two nodes: we call it
+\textit{fractal}. In this hierarchical scheme, cliques are
+assembled in
+larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group at the next level up. This results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
+
+Finally, we propose to connect cliques according to a
+small-world-like~\cite{watts2000small} topology applied on top of a
+ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
+ring. Then each clique adds symmetric edges, both clockwise and
+counter-clockwise on the ring, with the $ns$ closest cliques in sets of
+cliques that are exponentially bigger the further they are on the ring (see
+Algorithm~\ref{Algorithm:Smallworld} in the appendix for
+details on the construction). This ensures a good clustering with other
+cliques that are close on the ring, while still keeping the average shortest
+path small. This scheme uses $2(ns)log(\frac{n}{c})$ inter-clique edges and
+therefore grows in the order of $O(n + log(n))$ with the number of nodes.
+
+Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence speed
+for all schemes on MNIST and CIFAR10, compared to the ideal baseline of a
+single IID node performing the same number of updates per epoch (showing the
+fastest convergence speed achievable if topology had no impact). The ring
+topology converges but is much slower, while our fractal scheme helps
+significantly. The sweet spot appears to be with the small-world
+topology, as the convergence speed is almost the same as with a
+fully-connected inter-clique topology but with 22\% less edges
+(14.5 edges on average instead of 18.9). We can also expect bigger gains at
+larger scales. Nonetheless, even the fully-connected topology offers
+significant benefits with 1000 nodes, as it represents a 98\% reduction in the
+number of edges compared to fully connecting individual nodes (18.9 edges on
+average instead of 999) and a 96\% reduction in the number of messages (37.8
+messages per round per node on average instead of 999). Overall, these results
+show that D-Cliques can scale nicely with the number of nodes.

-Figure~\ref{fig:d-cliques-cifar10-convolutional} shows convergence speeds for all schemes, both on MNIST and CIFAR10, compared to a single IID node performing the same number of updates per epoch (showing the fastest convergence speed achievable if topology had no impact). A ring converges but is much slower. Our "fractal" scheme helps significantly. But the sweet spot really seems to be with a smallworld topology, as the convergence speed is almost the same to a fully-connected topology, but uses 22\% less edges at that scale (14.5 edges on average instead of 18.9), and seems to have potential to have larger benefits at larger scales. Nonetheless, even the fully-connected topology offers significant benefits with 1000 nodes, as it represents a 98\% reduction in the number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96\% reduction in the number of messages (37.8 messages per round per node on average instead of 999). 
-
-\begin{figure}[htbp]
+\begin{figure}[t]
     \centering
 % To regenerate the figure, from directory results/mnist
 % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET ../scaling/1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET ../scaling/1000/mnist/smallworld-logn-cliques/all/2021-03-23-21:45:39-CET ../scaling/1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET ../scaling/1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET     --add-min-max --yaxis test-accuracy --legend 'lower right' --ymin 84 --ymax 92.5 --labels '1 node IID'  'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)'  --save-figure ../../figures/d-cliques-mnist-1000-nodes-comparison.png --font-size 13
@@ -829,7 +913,7 @@ the network, see for instance
 However, for IID data, practice contradicts these classic
 results: fully decentralized algorithms converge essentially as fast
 on sparse topologies like rings or grids as they do on a fully connected
-graph \cite{lian2017d-psgd,Lian2018}. Recent work 
+network \cite{lian2017d-psgd,Lian2018}. Recent work 
 \cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
 smaller in the IID case. However, these results do not give any clear insight
 regarding the role of the topology in the non-IID case. We note that some work