@@ -130,7 +130,7 @@ speed compared to using denser topologies.
...
@@ -130,7 +130,7 @@ speed compared to using denser topologies.
% privacy protection \cite{amp_dec}.
% privacy protection \cite{amp_dec}.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
...
@@ -380,8 +380,7 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi
...
@@ -380,8 +380,7 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi
To further make results comparable across different number of nodes, we lower
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures that the number of model updates and averaging per epoch remains the
ensures the same number of model updates and averaging per epoch, which is
same, which is
important to have a fair comparison.\footnote{Updating and averaging models
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}
resulting communication overhead is impractical.}
...
@@ -483,7 +482,7 @@ topology, namely:
...
@@ -483,7 +482,7 @@ topology, namely:
We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
for a formal account of D-Cliques construction. We note that it only requires
for a formal account of D-Cliques construction. We note that it only requires
the knowledge of the local class distribution at each node. For the sake of
the knowledge of the local class distribution at each node. For the sake of
simplicity, we assume that D-Cliques are constructed from the global
simplicity, we assume that D-Cliques is constructed from the global
knowledge of these distributions, which can easily be obtained by
knowledge of these distributions, which can easily be obtained by
decentralized averaging in a pre-processing step.
decentralized averaging in a pre-processing step.
...
@@ -543,12 +542,10 @@ introduced by inter-clique edges. We address this issue in the next section.
...
@@ -543,12 +542,10 @@ introduced by inter-clique edges. We address this issue in the next section.
\section{Optimizing with Clique Averaging and Momentum}
\section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum}
\label{section:clique-averaging-momentum}
In this section, we present Clique Averaging, a feature that we add to
In this section, we present Clique Averaging. This feature, when added to D-SGD,
D-SGD to remove the bias caused by the inter-cliques edges of
removes the bias caused by the inter-cliques edges of
D-Cliques. We then show how this can be used to successfully implement
D-Cliques. We also show how it can be used to successfully implement momentum
momentum
for non-IID data.
for non-IID data.
%AMK: check
\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\label{section:clique-averaging}
\label{section:clique-averaging}
...
@@ -745,7 +742,7 @@ convergence.
...
@@ -745,7 +742,7 @@ convergence.
Crucially, all random topologies fail to converge to a good solution. This
Crucially, all random topologies fail to converge to a good solution. This
confirms that our clique structure is important to reduce variance
confirms that our clique structure is important to reduce variance
across nodes and improve the convergence. The difference with the previous
across nodes and improve the convergence. The difference with the previous
experiment appears to be due to both the use of a higher capacity model and to
experiment seems to be due to both the use of a higher capacity model and to
the intrinsic characteristics of the datasets.
the intrinsic characteristics of the datasets.
% We refer
% We refer
% to the appendix for results on MNIST with LeNet.
% to the appendix for results on MNIST with LeNet.
...
@@ -855,8 +852,7 @@ number of edges compared to fully connecting individual nodes (18.9 edges on
...
@@ -855,8 +852,7 @@ number of edges compared to fully connecting individual nodes (18.9 edges on
average instead of 999) and a 96\% reduction in the number of messages (37.8
average instead of 999) and a 96\% reduction in the number of messages (37.8
messages per round per node on average instead of 999). We refer to
messages per round per node on average instead of 999). We refer to
Appendix~\ref{app:scaling} for additional results comparing the convergence
Appendix~\ref{app:scaling} for additional results comparing the convergence
speed across
speed across different number of nodes. Overall, these results
different number of nodes. Overall, our results
show that D-Cliques can nicely scale with the number of nodes.
show that D-Cliques can nicely scale with the number of nodes.
\begin{figure}[t]
\begin{figure}[t]
...
@@ -937,7 +933,7 @@ networks. We do not modify the simple
...
@@ -937,7 +933,7 @@ networks. We do not modify the simple
and efficient D-SGD
and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions
contributions
that would otherwise bias the direction of the gradient.
that otherwise bias the gradient direction.
% An originality of our approach is to focus on the effect of topology
% An originality of our approach is to focus on the effect of topology
% level without significantly changing the original simple and efficient D-SGD
% level without significantly changing the original simple and efficient D-SGD