Skip to content
Snippets Groups Projects
Commit 29319c39 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

sec 3

parent 64ffffd5
No related branches found
No related tags found
No related merge requests found
......@@ -408,16 +408,19 @@ mini-batch size, both approaches are equivalent. %ensure a single
In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
% where each color represents a class of data.
The colors of a node represent the different classes it holds
locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
The colors of a node represent the different classes present in its local
dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has examples of all classes in equal proportions. In the non-IID setting
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
a
single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
single class and nodes are distributed randomly in the grid.
A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In the IID case, since gradients are computed from examples of all classes,
the resulting average gradient points in a direction that reduces the
loss across all classes. In contrast, in the non-IID case, only a subset of classes are
represented in the immediate neighborhood of the node and the gradients will
the resulting averaged gradient points in a direction that tends to reduce
the loss across all classes. In contrast, in the non-IID case, only a subset
of classes are
represented in the immediate neighborhood of the node, thus the gradients will
be biased towards these classes. % more than in the IID case.
Importantly, as the distributed averaging algorithm takes several steps to
converge, this variance persists across iterations as the locally computed
......@@ -442,7 +445,7 @@ impractical.
\includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
\caption{\label{fig:grid-non-iid-neighbourhood} Non-IID}
\end{subfigure}
\caption{Neighborhood in an IID and non-IID Grid.}
\caption{Neighborhood in an IID and non-IID grid.}
\label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}
......@@ -463,9 +466,9 @@ edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
The mixing matrix $W$ required by D-SGD is obtained from the above
topology using standard
Metropolis-Hasting weights~\cite{xiao2004fast}:
The mixing matrix $W$ required by D-SGD is obtained from standard
Metropolis-Hasting weights~\cite{xiao2004fast} computed from the above
topology, namely:
\begin{equation}
W_{ij} = \begin{cases}
\frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq
......@@ -519,15 +522,15 @@ speed on MNIST.}
\end{figure}
Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
performance D-Cliques on MNIST with $n=100$ nodes. The convergence speed is
performance D-Cliques on MNIST with $n=100$ nodes. Observe that the
convergence speed is
very close
to that of a fully-connected topology, and significantly better than with
a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With
100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
compared to a fully-connected topology. Nonetheless, there is still
significant variance in the accuracy across nodes, which we address in
the next section by removing the bias introduced by inter-clique edges.
significant variance in the accuracy across nodes, which is due to the bias
introduced by inter-clique edges. We address this issue in the next section.
%The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.
%
......@@ -539,7 +542,7 @@ the next section by removing the bias introduced by inter-clique edges.
\section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum}
In this section we present Clique Averaging, a simple modification of D-SGD
In this sectio, we present Clique Averaging, a simple modification of D-SGD
which removes the bias caused by the inter-cliques edges of
D-Cliques, and show how this can be used to successfully implement momentum
for non-IID data.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment