Added AMK feedback

cce0a9dc · Erick Lavoie · e1059e5d · cce0a9dc
Commit cce0a9dc authored 3 years ago by Erick Lavoie
--- a/main.tex
+++ b/main.tex
@@ -130,7 +130,7 @@ speed compared to using denser topologies.
 % privacy protection \cite{amp_dec}.
 In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
-in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that  a ring or
+in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that  a ring or
 a grid topology clearly jeopardizes the convergence speed as local
 distributions do not have relative frequency of classes similar to the global
 distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
@@ -380,8 +380,7 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi
 To further make results comparable across different number of nodes, we lower
 the batch size proportionally to the number of nodes added, and inversely,
 e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
-ensures that the number of model updates and averaging per epoch remains the
+ensures the same number of model updates and averaging per epoch, which is
-same, which is
 important to have a fair comparison.\footnote{Updating and averaging models
 after every example can eliminate the impact of local class bias. However, the
 resulting communication overhead is impractical.}
@@ -483,7 +482,7 @@ topology, namely:
 We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
 for a formal account of D-Cliques construction. We note that it only requires
 the knowledge of the local class distribution at each node. For the sake of
-simplicity, we assume that D-Cliques are constructed from the global
+simplicity, we assume that D-Cliques is constructed from the global
 knowledge of these distributions, which can easily be obtained by
 decentralized averaging in a pre-processing step. 
@@ -543,12 +542,10 @@ introduced by inter-clique edges. We address this issue in the next section.
 \section{Optimizing with Clique Averaging and Momentum}
 \label{section:clique-averaging-momentum}
-In this section, we present Clique Averaging, a feature that we add to
+In this section, we present Clique Averaging. This feature, when added to D-SGD,
-D-SGD to remove the bias caused by the inter-cliques edges of
+removes the bias caused by the inter-cliques edges of
-D-Cliques. We then show how this can be used to successfully implement
+D-Cliques. We also show how it can be used to successfully implement momentum
-momentum
 for non-IID data.
-%AMK: check
 \subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
 \label{section:clique-averaging}
@@ -745,7 +742,7 @@ convergence.
 Crucially, all random topologies fail to converge to a good solution. This
 confirms that our clique structure is important to reduce variance
 across nodes and improve the convergence. The difference with the previous
-experiment appears to be due to both the use of a higher capacity model and to
+experiment seems to be due to both the use of a higher capacity model and to
 the intrinsic characteristics of the datasets.
 % We refer
 % to the appendix for results on MNIST with LeNet.
@@ -855,8 +852,7 @@ number of edges compared to fully connecting individual nodes (18.9 edges on
 average instead of 999) and a 96\% reduction in the number of messages (37.8
 messages per round per node on average instead of 999). We refer to
 Appendix~\ref{app:scaling} for additional results comparing the convergence
-speed across
+speed across different number of nodes. Overall, these results
-different number of nodes. Overall, our results
 show that D-Cliques can nicely scale with the number of nodes.
 \begin{figure}[t]
@@ -937,7 +933,7 @@ networks. We do not modify the simple
 and efficient D-SGD
 algorithm \cite{lian2017d-psgd} beyond removing some neighbor
 contributions
-that would otherwise bias the direction of the gradient.
+that otherwise bias the gradient direction.
 % An originality of our approach is to focus on the effect of topology
 % level without significantly changing the original simple and efficient D-SGD