sec 4

dc012f43 · aurelien.bellet · 29319c39 · dc012f43
Commit dc012f43 authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -476,6 +476,7 @@ topology, namely:
   1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$, \\
   0 & \text{otherwise}.
  \end{cases}
+  \label{eq:metro}
 \end{equation}


@@ -552,17 +553,18 @@ for non-IID data.
 \label{section:clique-averaging}

 While limiting the number of inter-clique connections reduces the
-amount of data traveling on the network, it can also introduce some bias.
-Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the
-simplest case of two cliques connected by one inter-clique edge (here,
+amount of messages traveling on the network, it also introduces its own
+bias.
+Figure~\ref{fig:connected-cliques-bias} illustrates the problem on the
+simple case of two cliques connected by one inter-clique edge (here,
 between the green node of the left clique and the purple node of the right
-clique). Let us focus on node A. With Metropolis-Hasting, node A's self-weight
-is $\frac{12}
+clique). Let us focus on node A. With weights computed as in \eqref{eq:metro},
+node A's self-weight is $\frac{12}
 {110}$, the weight between A and the green node connected to B is
 $\frac{10}{110}$, and
 all other neighbors of A have a weight of $\frac{11}{110}$. Therefore, the
 gradient at A is biased towards its own class (purple) and against the green
-class. The same holds for all other nodes
+class. A similar bias holds for all other nodes
 without inter-clique edges with respect to their respective classes. For node
 B, all its edge weights (including its self-weight) are equal to $\frac{1}
 {11}$. However, the green class is represented twice (once as a clique
@@ -580,8 +582,9 @@ inter-clique connections (see main text).}

 We address this problem by adding \emph{Clique Averaging} to D-SGD
 (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), which essentially
-decouples gradient averaging from model averaging. Only the gradients of
-neighbors within the same clique are used to compute the average gradient,
+decouples gradient averaging from model averaging. The idea is to use only the
+gradients of
+neighbors within the same clique to compute the average gradient,
 providing an equal representation to all classes. In contrast, all neighbors'
 models, including those across inter-clique edges, participate in the model
 averaging step as in the original version.
@@ -613,14 +616,15 @@ averaging step as in the original version.

 As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
 significantly reduces the variance of models across nodes and accelerates
-convergence speed to reach the same level as the one obtained with a
-fully-connected topology. There is a small additional cost, since gradients
+convergence to reach the same level as the one obtained with a
+fully-connected topology. Note that Clique Averaging induces a small
+additional cost, as gradients
 and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.

 \subsection{Implementing Momentum with Clique Averaging}
 \label{section:momentum}

-Efficiently training higher capacity models usually requires additional
+Efficiently training high capacity models usually requires additional
 optimization techniques. In particular, momentum~\cite{pmlr-v28-sutskever13}
 increases the magnitude of the components of the gradient that are shared
 between several consecutive steps, and is critical for deep convolutional networks like
@@ -628,9 +632,9 @@ LeNet~\cite{lecun1998gradient,quagmire} to converge quickly. However, a direct
 application of momentum in a non-IID setting can actually be very detrimental.
 As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}
 for the case of LeNet on CIFAR10 with 100 nodes, D-Cliques with momentum
-actually fails to converge. Not using momentum actually gives a faster
+even fails to converge. Not using momentum actually gives a faster
 convergence, but there is a significant gap compared to the case of a single
-IID node using momentum.
+IID node with momentum.

 \begin{figure}[t]
    \centering 
@@ -663,8 +667,8 @@ It then suffices to modify the original gradient step to use momentum:
 x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} 
 \end{equation}

-As shown in Figure~
-\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this
+As shown in
+Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this
 simple modification restores the benefits of momentum and closes the gap
 with the centralized setting.