sec 4.2

5454035a · aurelien.bellet · 99752600 · 5454035a
Commit 5454035a authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -605,12 +605,19 @@ and models need to be sent in two separate rounds of messages. Nonetheless, comp
 \subsection{Implementing Momentum with Clique Averaging}
 \label{section:momentum}

-Quickly training higher capacity models, such as a deep convolutional network, on harder datasets, such as CIFAR10, usually requires additional optimization techniques. We show here how Clique Averaging (Section~\ref{section:clique-averaging}) easily enables the implementation of optimization techniques in the presence of local class bias, that otherwise would require IID mini-batches.
+Efficiently training higher capacity models usually requires additional
+optimization techniques. In particular, momentum~\cite{pmlr-v28-sutskever13}
+increases the magnitude of the components of the gradient that are shared
+between several consecutive steps, and is critical for deep convolutional networks like
+LeNet~\cite{lecun1998gradient,quagmire} to converge quickly. However, a direct
+application of momentum in a non-IID setting can actually be very detrimental.
+As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}
+for the case of LeNet on CIFAR10 with 100 nodes, D-Cliques with momentum
+actually fails to converge. Not using momentum actually gives a faster
+convergence, but there is a significant gap compared to the case of a single
+IID node using momentum.

-In particular, we implement momentum~\cite{pmlr-v28-sutskever13}, which increases the magnitude of the components of the gradient that are shared between several consecutive steps. Momentum is critical for making deep convolutional networks, such as LeNet~\cite{lecun1998gradient,quagmire}, converge quickly. However, a simple application of momentum in a non-IID setting can actually be detrimental. As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}, LeNet, on CIFAR10 with 100 nodes using the 
-D-Cliques and momentum, actually fails to converge. As shown, not using momentum gives a better convergence speed, but there is still a significant gap compared to a single IID node.
-
-\begin{figure}[htbp]
+\begin{figure}[t]
    \centering 
    % To regenerate figure, from results/cifar10
    % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET  no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum'  '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum'  --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100         
@@ -630,7 +637,9 @@ D-Cliques and momentum, actually fails to converge. As shown, not using momentum
 \caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
 \end{figure}

-Using Clique Averaging (Section~\ref{section:clique-averaging}), unbiased momentum can be calculated from the unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
+We show here that Clique Averaging (Section~\ref{section:clique-averaging})
+allows us to compute an unbiased momentum from the
+unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
 \begin{equation}
 v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)} 
 \end{equation}
@@ -639,9 +648,12 @@ It then suffices to modify the original gradient step to use momentum:
 x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} 
 \end{equation}

-Using momentum closes the gap, with a slightly lower convergence speed in the first 20 epochs, as illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}.
+As shown in Figure~
+\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this
+simple modification restores the benefits of momentum and closes the gap
+with the centralized setting.

-\section{Comparative evalution and extensions}
+\section{Comparative Evaluation and Extensions}
 \label{section:non-clustered}

 %AMK: add what is in there
@@ -653,7 +665,7 @@ In this section, we first compare D-Cliques to alternatives, we then further eva

 %We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
 We compare D-cliques against competitors to demonstrate its  advantages  over alternative topologies.
-First, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours. In the latter case, since nodes do not form a clique, no node actually compute the same resulting average gradient.
+First, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbors of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbors. In the latter case, since nodes do not form a clique, no node actually compute the same resulting average gradient.

 Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse  than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems that clustering in this case has a small but significant effect.

@@ -832,7 +844,8 @@ non-IID data.
 \section{Conclusion}
 \label{section:conclusion}

-We have proposed D-Cliques, a sparse topology that recovers the convergence speed and non-IID compensating behaviour of a fully-connected topology in the presence of local class bias. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually only possible with IID mini-batches. We have shown the clustering of D-Cliques and full connectivity within cliques to be critical in obtaining these results. Finally, we have evaluated different inter-clique topologies with 1000 nodes. While they all provide significant reduction in the number of edges compared to fully connecting all nodes, a smallworld approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed. The D-Clique topology approach therefore seems promising to reduce bandwidth usage on FL servers and to implement fully decentralized alternatives in a wider range of applications. For instance, the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}. This is part of our future work.
+We have proposed D-Cliques, a sparse topology that recovers the convergence
+speed and non-IID compensating behaviour of a fully-connected topology in the presence of local class bias. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually only possible with IID mini-batches. We have shown the clustering of D-Cliques and full connectivity within cliques to be critical in obtaining these results. Finally, we have evaluated different inter-clique topologies with 1000 nodes. While they all provide significant reduction in the number of edges compared to fully connecting all nodes, a small-world approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed. The D-Clique topology approach therefore seems promising to reduce bandwidth usage on FL servers and to implement fully decentralized alternatives in a wider range of applications. For instance, the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbors could be selected with PeerSampling~\cite{jelasity2007gossip}. This is part of our future work.

 %\section{Future Work}
 %\begin{itemize}
@@ -889,7 +902,7 @@ We have proposed D-Cliques, a sparse topology that recovers the convergence spee
   \caption{$\textit{smallworld}(DC)$:  adds $O(\# N + log(\# N))$ edges}
   \label{Algorithm:Smallworld}
   \begin{algorithmic}[1]
-        \State \textbf{Require} Set of cliques $DC$ (set of set of nodes), size of neighbourhood $ns$ (default 2), function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
+        \State \textbf{Require} Set of cliques $DC$ (set of set of nodes), size of neighborhood $ns$ (default 2), function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
        \State $E \leftarrow \emptyset$ \Comment{Set of Edges}
        \State $L \leftarrow [ C~\text{for}~C \in DC ]$ \Comment{Arrange cliques in a list}
        \For{$i \in \{1,\dots,\#DC\}$} \Comment{For every clique}