From cce0a9dc7963a41b4ecde31b8750e99821e7ce7b Mon Sep 17 00:00:00 2001
From: Erick Lavoie <erick.lavoie@epfl.ch>
Date: Fri, 2 Apr 2021 22:29:17 +0200
Subject: [PATCH] Added AMK feedback

---
 main.tex | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/main.tex b/main.tex
index 987f5b7..56dac9e 100644
--- a/main.tex
+++ b/main.tex
@@ -130,7 +130,7 @@ speed compared to using denser topologies.
 % privacy protection \cite{amp_dec}.
 
 In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
-in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that  a ring or
+in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that  a ring or
 a grid topology clearly jeopardizes the convergence speed as local
 distributions do not have relative frequency of classes similar to the global
 distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
@@ -380,8 +380,7 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi
 To further make results comparable across different number of nodes, we lower
 the batch size proportionally to the number of nodes added, and inversely,
 e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
-ensures that the number of model updates and averaging per epoch remains the
-same, which is
+ensures the same number of model updates and averaging per epoch, which is
 important to have a fair comparison.\footnote{Updating and averaging models
 after every example can eliminate the impact of local class bias. However, the
 resulting communication overhead is impractical.}
@@ -483,7 +482,7 @@ topology, namely:
 We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
 for a formal account of D-Cliques construction. We note that it only requires
 the knowledge of the local class distribution at each node. For the sake of
-simplicity, we assume that D-Cliques are constructed from the global
+simplicity, we assume that D-Cliques is constructed from the global
 knowledge of these distributions, which can easily be obtained by
 decentralized averaging in a pre-processing step. 
 
@@ -543,12 +542,10 @@ introduced by inter-clique edges. We address this issue in the next section.
 \section{Optimizing with Clique Averaging and Momentum}
 \label{section:clique-averaging-momentum}
 
-In this section, we present Clique Averaging, a feature that we add to
-D-SGD to remove the bias caused by the inter-cliques edges of
-D-Cliques. We then show how this can be used to successfully implement
-momentum
+In this section, we present Clique Averaging. This feature, when added to D-SGD,
+removes the bias caused by the inter-cliques edges of
+D-Cliques. We also show how it can be used to successfully implement momentum
 for non-IID data.
-%AMK: check
 
 \subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
 \label{section:clique-averaging}
@@ -745,7 +742,7 @@ convergence.
 Crucially, all random topologies fail to converge to a good solution. This
 confirms that our clique structure is important to reduce variance
 across nodes and improve the convergence. The difference with the previous
-experiment appears to be due to both the use of a higher capacity model and to
+experiment seems to be due to both the use of a higher capacity model and to
 the intrinsic characteristics of the datasets.
 % We refer
 % to the appendix for results on MNIST with LeNet.
@@ -855,8 +852,7 @@ number of edges compared to fully connecting individual nodes (18.9 edges on
 average instead of 999) and a 96\% reduction in the number of messages (37.8
 messages per round per node on average instead of 999). We refer to
 Appendix~\ref{app:scaling} for additional results comparing the convergence
-speed across
-different number of nodes. Overall, our results
+speed across different number of nodes. Overall, these results
 show that D-Cliques can nicely scale with the number of nodes.
 
 \begin{figure}[t]
@@ -937,7 +933,7 @@ networks. We do not modify the simple
 and efficient D-SGD
 algorithm \cite{lian2017d-psgd} beyond removing some neighbor
 contributions
-that would otherwise bias the direction of the gradient.
+that otherwise bias the gradient direction.
 
 % An originality of our approach is to focus on the effect of topology
 % level without significantly changing the original simple and efficient D-SGD
-- 
GitLab