Sec 4.1

99752600 · aurelien.bellet · 014a9f4d · 99752600
Commit 99752600 authored 4 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -527,28 +527,54 @@ the next section by removing the bias introduced by inter-clique edges.
 % In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed.


-\section{Optimizing with Clique Averaging}
+\section{Optimizing with Clique Averaging and Momentum}
 \label{section:clique-averaging-momentum}

-In this section we present Clique Averaging,a feature that removes further the bias introduce by data non-iidness.
+In this section we present Clique Averaging, a simple modification of D-SGD
+which removes the bias caused by the inter-cliques edges of
+D-Cliques, and show how this can be used to successfully implement momentum
+for non-IID data.
 %AMK: check

-\subsection{Removing Gradient Bias from Inter-Clique Edges}
+\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
 \label{section:clique-averaging}

-While limiting the number of inter-clique connections also limits the amount of data traveling on the network, it may introduce some bias as observed in ours experiments, either because of the Metropolis-Hasting edge weights or because some classes are more represented in a neighbourhood. Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge, i.e. this edge connects the green node of the left clique with the purple node of the right clique. 
+While limiting the number of inter-clique connections reduces the
+amount of data traveling on the network, it can also introduce some bias.
+Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the
+simplest case of two cliques connected by one inter-clique edge (here,
+between the green node of the left clique and the purple node of the right
+clique). Let us focus on node A. With Metropolis-Hasting, node A's self-weight
+is $\frac{12}
+{110}$, the weight between A and the green node connected to B is
+$\frac{10}{110}$, and
+all other neighbors of A have a weight of $\frac{11}{110}$. Therefore, the
+gradient at A is biased towards its own class (purple) and against the green
+class. The same holds for all other nodes
+without inter-clique edges with respect to their respective classes. For node
+B, all its edge weights (including its self-weight) are equal to $\frac{1}
+{11}$. However, the green class is represented twice (once as a clique
+neighbor and once from the inter-clique edge), while all other classes are
+represented only once. This biases the gradient toward the green class. The
+combined effect of these two sources of bias is to increase the variance
+of the local models across nodes.

-Using Metropolis-Hasting weights, Node A implicit self-edge has a weight of $\frac{12}{110}$ while all of A's neighbours have a weight of $\frac{11}{110}$, except the green node connected to B, that has a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's purple class and away from the green class. The same analysis holds for all other nodes without inter-clique edges with their respective classes. For node B, all edges and B's self-edge have weights of $\frac{1}{11}$. However, the green class is represented twice, once as a clique neighbour and once at the other end of the inter-clique edge, while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.
-
-\begin{figure}[htbp]
+\begin{figure}[t]
         \centering
         \includegraphics[width=0.5\textwidth]{figures/connected-cliques-bias}
-\caption{\label{fig:connected-cliques-bias} Sources of Bias in Connected Cliques: Non-uniform weights in neighbours of A (A has a higher weight); Non-uniform class representation in neighbours of B (extra green node).}
+\caption{\label{fig:connected-cliques-bias} Illustrating the bias induced by
+inter-clique connections (see main text).}
 \end{figure}

-We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}): gradient averaging is decoupled from model averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, providing an equal representation to all classes. But all models of neighbours, including those across inter-clique edges, participate in the model averaging as in the original version.
+We address this problem by adding \emph{Clique Averaging} to D-PSGD
+(Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), which essentially
+decouples gradient averaging from model averaging. Only the gradients of
+neighbors within the same clique are used to compute the average gradient,
+providing an equal representation to all classes. In contrast, all neighbors'
+models, including those across inter-clique edges, participate in the model
+averaging step as in the original version.

-\begin{algorithm}[h]
+\begin{algorithm}[t]
   \caption{D-PSGD with Clique Averaging, Node $i$}
   \label{Algorithm:Clique-Unbiased-D-PSGD}
   \begin{algorithmic}[1]
@@ -564,13 +590,17 @@ We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algor

 % To regenerate figure, from results/mnist:
 % python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png
-\begin{figure}[htbp]
+\begin{figure}[t]
         \centering
         \includegraphics[width=0.55\textwidth]{figures/d-clique-mnist-clique-avg}
 \caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
 \end{figure}

-As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. 
+As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
+significantly reduces the variance of models across nodes and accelerates
+convergence speed to reach the same level as the one obtained with a
+fully-connected topology. There is a small additional cost, since gradients
+and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.

 \subsection{Implementing Momentum with Clique Averaging}
 \label{section:momentum}