Skip to content
Snippets Groups Projects
Commit dc012f43 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

sec 4

parent 29319c39
No related branches found
No related tags found
No related merge requests found
......@@ -476,6 +476,7 @@ topology, namely:
1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$, \\
0 & \text{otherwise}.
\end{cases}
\label{eq:metro}
\end{equation}
......@@ -552,17 +553,18 @@ for non-IID data.
\label{section:clique-averaging}
While limiting the number of inter-clique connections reduces the
amount of data traveling on the network, it can also introduce some bias.
Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the
simplest case of two cliques connected by one inter-clique edge (here,
amount of messages traveling on the network, it also introduces its own
bias.
Figure~\ref{fig:connected-cliques-bias} illustrates the problem on the
simple case of two cliques connected by one inter-clique edge (here,
between the green node of the left clique and the purple node of the right
clique). Let us focus on node A. With Metropolis-Hasting, node A's self-weight
is $\frac{12}
clique). Let us focus on node A. With weights computed as in \eqref{eq:metro},
node A's self-weight is $\frac{12}
{110}$, the weight between A and the green node connected to B is
$\frac{10}{110}$, and
all other neighbors of A have a weight of $\frac{11}{110}$. Therefore, the
gradient at A is biased towards its own class (purple) and against the green
class. The same holds for all other nodes
class. A similar bias holds for all other nodes
without inter-clique edges with respect to their respective classes. For node
B, all its edge weights (including its self-weight) are equal to $\frac{1}
{11}$. However, the green class is represented twice (once as a clique
......@@ -580,8 +582,9 @@ inter-clique connections (see main text).}
We address this problem by adding \emph{Clique Averaging} to D-SGD
(Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), which essentially
decouples gradient averaging from model averaging. Only the gradients of
neighbors within the same clique are used to compute the average gradient,
decouples gradient averaging from model averaging. The idea is to use only the
gradients of
neighbors within the same clique to compute the average gradient,
providing an equal representation to all classes. In contrast, all neighbors'
models, including those across inter-clique edges, participate in the model
averaging step as in the original version.
......@@ -613,14 +616,15 @@ averaging step as in the original version.
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
significantly reduces the variance of models across nodes and accelerates
convergence speed to reach the same level as the one obtained with a
fully-connected topology. There is a small additional cost, since gradients
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
\subsection{Implementing Momentum with Clique Averaging}
\label{section:momentum}
Efficiently training higher capacity models usually requires additional
Efficiently training high capacity models usually requires additional
optimization techniques. In particular, momentum~\cite{pmlr-v28-sutskever13}
increases the magnitude of the components of the gradient that are shared
between several consecutive steps, and is critical for deep convolutional networks like
......@@ -628,9 +632,9 @@ LeNet~\cite{lecun1998gradient,quagmire} to converge quickly. However, a direct
application of momentum in a non-IID setting can actually be very detrimental.
As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}
for the case of LeNet on CIFAR10 with 100 nodes, D-Cliques with momentum
actually fails to converge. Not using momentum actually gives a faster
even fails to converge. Not using momentum actually gives a faster
convergence, but there is a significant gap compared to the case of a single
IID node using momentum.
IID node with momentum.
\begin{figure}[t]
\centering
......@@ -663,8 +667,8 @@ It then suffices to modify the original gradient step to use momentum:
x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)}
\end{equation}
As shown in Figure~
\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this
As shown in
Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this
simple modification restores the benefits of momentum and closes the gap
with the centralized setting.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment