Skip to content
Snippets Groups Projects
Commit 29319c39 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

sec 3

parent 64ffffd5
No related branches found
No related tags found
No related merge requests found
...@@ -408,16 +408,19 @@ mini-batch size, both approaches are equivalent. %ensure a single ...@@ -408,16 +408,19 @@ mini-batch size, both approaches are equivalent. %ensure a single
In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
% where each color represents a class of data. % where each color represents a class of data.
The colors of a node represent the different classes it holds The colors of a node represent the different classes present in its local
locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has examples of all classes in equal proportions. In the non-IID setting node has examples of all classes in equal proportions. In the non-IID setting
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
a a
single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. single class and nodes are distributed randomly in the grid.
A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In the IID case, since gradients are computed from examples of all classes, In the IID case, since gradients are computed from examples of all classes,
the resulting average gradient points in a direction that reduces the the resulting averaged gradient points in a direction that tends to reduce
loss across all classes. In contrast, in the non-IID case, only a subset of classes are the loss across all classes. In contrast, in the non-IID case, only a subset
represented in the immediate neighborhood of the node and the gradients will of classes are
represented in the immediate neighborhood of the node, thus the gradients will
be biased towards these classes. % more than in the IID case. be biased towards these classes. % more than in the IID case.
Importantly, as the distributed averaging algorithm takes several steps to Importantly, as the distributed averaging algorithm takes several steps to
converge, this variance persists across iterations as the locally computed converge, this variance persists across iterations as the locally computed
...@@ -442,7 +445,7 @@ impractical. ...@@ -442,7 +445,7 @@ impractical.
\includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood} \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
\caption{\label{fig:grid-non-iid-neighbourhood} Non-IID} \caption{\label{fig:grid-non-iid-neighbourhood} Non-IID}
\end{subfigure} \end{subfigure}
\caption{Neighborhood in an IID and non-IID Grid.} \caption{Neighborhood in an IID and non-IID grid.}
\label{fig:grid-iid-vs-non-iid-neighbourhood} \label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure} \end{figure}
...@@ -463,9 +466,9 @@ edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the ...@@ -463,9 +466,9 @@ edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$ corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}. classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
The mixing matrix $W$ required by D-SGD is obtained from the above The mixing matrix $W$ required by D-SGD is obtained from standard
topology using standard Metropolis-Hasting weights~\cite{xiao2004fast} computed from the above
Metropolis-Hasting weights~\cite{xiao2004fast}: topology, namely:
\begin{equation} \begin{equation}
W_{ij} = \begin{cases} W_{ij} = \begin{cases}
\frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq \frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq
...@@ -519,15 +522,15 @@ speed on MNIST.} ...@@ -519,15 +522,15 @@ speed on MNIST.}
\end{figure} \end{figure}
Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
performance D-Cliques on MNIST with $n=100$ nodes. The convergence speed is performance D-Cliques on MNIST with $n=100$ nodes. Observe that the
convergence speed is
very close very close
to that of a fully-connected topology, and significantly better than with to that of a fully-connected topology, and significantly better than with
a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With
100 nodes, it offers a reduction of $\approx90\%$ in the number of edges 100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
compared to a fully-connected topology. Nonetheless, there is still compared to a fully-connected topology. Nonetheless, there is still
significant variance in the accuracy across nodes, which we address in significant variance in the accuracy across nodes, which is due to the bias
the next section by removing the bias introduced by inter-clique edges. introduced by inter-clique edges. We address this issue in the next section.
%The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. %The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.
% %
...@@ -539,7 +542,7 @@ the next section by removing the bias introduced by inter-clique edges. ...@@ -539,7 +542,7 @@ the next section by removing the bias introduced by inter-clique edges.
\section{Optimizing with Clique Averaging and Momentum} \section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum} \label{section:clique-averaging-momentum}
In this section we present Clique Averaging, a simple modification of D-SGD In this sectio, we present Clique Averaging, a simple modification of D-SGD
which removes the bias caused by the inter-cliques edges of which removes the bias caused by the inter-cliques edges of
D-Cliques, and show how this can be used to successfully implement momentum D-Cliques, and show how this can be used to successfully implement momentum
for non-IID data. for non-IID data.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment