Skip to content
Snippets Groups Projects
Commit cce0a9dc authored by Erick Lavoie's avatar Erick Lavoie
Browse files

Added AMK feedback

parent e1059e5d
No related branches found
No related tags found
No related merge requests found
...@@ -130,7 +130,7 @@ speed compared to using denser topologies. ...@@ -130,7 +130,7 @@ speed compared to using denser topologies.
% privacy protection \cite{amp_dec}. % privacy protection \cite{amp_dec}.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
...@@ -380,8 +380,7 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi ...@@ -380,8 +380,7 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi
To further make results comparable across different number of nodes, we lower To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely, the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures that the number of model updates and averaging per epoch remains the ensures the same number of model updates and averaging per epoch, which is
same, which is
important to have a fair comparison.\footnote{Updating and averaging models important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.} resulting communication overhead is impractical.}
...@@ -483,7 +482,7 @@ topology, namely: ...@@ -483,7 +482,7 @@ topology, namely:
We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
for a formal account of D-Cliques construction. We note that it only requires for a formal account of D-Cliques construction. We note that it only requires
the knowledge of the local class distribution at each node. For the sake of the knowledge of the local class distribution at each node. For the sake of
simplicity, we assume that D-Cliques are constructed from the global simplicity, we assume that D-Cliques is constructed from the global
knowledge of these distributions, which can easily be obtained by knowledge of these distributions, which can easily be obtained by
decentralized averaging in a pre-processing step. decentralized averaging in a pre-processing step.
...@@ -543,12 +542,10 @@ introduced by inter-clique edges. We address this issue in the next section. ...@@ -543,12 +542,10 @@ introduced by inter-clique edges. We address this issue in the next section.
\section{Optimizing with Clique Averaging and Momentum} \section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum} \label{section:clique-averaging-momentum}
In this section, we present Clique Averaging, a feature that we add to In this section, we present Clique Averaging. This feature, when added to D-SGD,
D-SGD to remove the bias caused by the inter-cliques edges of removes the bias caused by the inter-cliques edges of
D-Cliques. We then show how this can be used to successfully implement D-Cliques. We also show how it can be used to successfully implement momentum
momentum
for non-IID data. for non-IID data.
%AMK: check
\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges} \subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\label{section:clique-averaging} \label{section:clique-averaging}
...@@ -745,7 +742,7 @@ convergence. ...@@ -745,7 +742,7 @@ convergence.
Crucially, all random topologies fail to converge to a good solution. This Crucially, all random topologies fail to converge to a good solution. This
confirms that our clique structure is important to reduce variance confirms that our clique structure is important to reduce variance
across nodes and improve the convergence. The difference with the previous across nodes and improve the convergence. The difference with the previous
experiment appears to be due to both the use of a higher capacity model and to experiment seems to be due to both the use of a higher capacity model and to
the intrinsic characteristics of the datasets. the intrinsic characteristics of the datasets.
% We refer % We refer
% to the appendix for results on MNIST with LeNet. % to the appendix for results on MNIST with LeNet.
...@@ -855,8 +852,7 @@ number of edges compared to fully connecting individual nodes (18.9 edges on ...@@ -855,8 +852,7 @@ number of edges compared to fully connecting individual nodes (18.9 edges on
average instead of 999) and a 96\% reduction in the number of messages (37.8 average instead of 999) and a 96\% reduction in the number of messages (37.8
messages per round per node on average instead of 999). We refer to messages per round per node on average instead of 999). We refer to
Appendix~\ref{app:scaling} for additional results comparing the convergence Appendix~\ref{app:scaling} for additional results comparing the convergence
speed across speed across different number of nodes. Overall, these results
different number of nodes. Overall, our results
show that D-Cliques can nicely scale with the number of nodes. show that D-Cliques can nicely scale with the number of nodes.
\begin{figure}[t] \begin{figure}[t]
...@@ -937,7 +933,7 @@ networks. We do not modify the simple ...@@ -937,7 +933,7 @@ networks. We do not modify the simple
and efficient D-SGD and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions contributions
that would otherwise bias the direction of the gradient. that otherwise bias the gradient direction.
% An originality of our approach is to focus on the effect of topology % An originality of our approach is to focus on the effect of topology
% level without significantly changing the original simple and efficient D-SGD % level without significantly changing the original simple and efficient D-SGD
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment