@@ -527,28 +527,54 @@ the next section by removing the bias introduced by inter-clique edges.
% In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed.
\section{Optimizing with Clique Averaging}
\section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum}
In this section we present Clique Averaging,a feature that removes further the bias introduce by data non-iidness.
In this section we present Clique Averaging, a simple modification of D-SGD
which removes the bias caused by the inter-cliques edges of
D-Cliques, and show how this can be used to successfully implement momentum
for non-IID data.
%AMK: check
\subsection{Removing Gradient Bias from Inter-Clique Edges}
\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\label{section:clique-averaging}
While limiting the number of inter-clique connections also limits the amount of data traveling on the network, it may introduce some bias as observed in ours experiments, either because of the Metropolis-Hasting edge weights or because some classes are more represented in a neighbourhood. Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge, i.e. this edge connects the green node of the left clique with the purple node of the right clique.
While limiting the number of inter-clique connections reduces the
amount of data traveling on the network, it can also introduce some bias.
Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the
simplest case of two cliques connected by one inter-clique edge (here,
between the green node of the left clique and the purple node of the right
clique). Let us focus on node A. With Metropolis-Hasting, node A's self-weight
is $\frac{12}
{110}$, the weight between A and the green node connected to B is
$\frac{10}{110}$, and
all other neighbors of A have a weight of $\frac{11}{110}$. Therefore, the
gradient at A is biased towards its own class (purple) and against the green
class. The same holds for all other nodes
without inter-clique edges with respect to their respective classes. For node
B, all its edge weights (including its self-weight) are equal to $\frac{1}
{11}$. However, the green class is represented twice (once as a clique
neighbor and once from the inter-clique edge), while all other classes are
represented only once. This biases the gradient toward the green class. The
combined effect of these two sources of bias is to increase the variance
of the local models across nodes.
Using Metropolis-Hasting weights, Node A implicit self-edge has a weight of $\frac{12}{110}$ while all of A's neighbours have a weight of $\frac{11}{110}$, except the green node connected to B, that has a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's purple class and away from the green class. The same analysis holds for all other nodes without inter-clique edges with their respective classes. For node B, all edges and B's self-edge have weights of $\frac{1}{11}$. However, the green class is represented twice, once as a clique neighbour and once at the other end of the inter-clique edge, while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.
\caption{\label{fig:connected-cliques-bias} Sources of Bias in Connected Cliques: Non-uniform weights in neighbours of A (A has a higher weight); Non-uniform class representation in neighbours of B (extra green node).}
\caption{\label{fig:connected-cliques-bias} Illustrating the bias induced by
inter-clique connections (see main text).}
\end{figure}
We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}): gradient averaging is decoupled from model averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, providing an equal representation to all classes. But all models of neighbours, including those across inter-clique edges, participate in the model averaging as in the original version.
We address this problem by adding \emph{Clique Averaging} to D-PSGD
(Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), which essentially
decouples gradient averaging from model averaging. Only the gradients of
neighbors within the same clique are used to compute the average gradient,
providing an equal representation to all classes. In contrast, all neighbors'
models, including those across inter-clique edges, participate in the model
averaging step as in the original version.
\begin{algorithm}[h]
\begin{algorithm}[t]
\caption{D-PSGD with Clique Averaging, Node $i$}
\label{Algorithm:Clique-Unbiased-D-PSGD}
\begin{algorithmic}[1]
...
...
@@ -564,13 +590,17 @@ We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algor
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx80\%$.
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
significantly reduces the variance of models across nodes and accelerates
convergence speed to reach the same level as the one obtained with a
fully-connected topology. There is a small additional cost, since gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx80\%$.
\subsection{Implementing Momentum with Clique Averaging}