typo sec 2

d41ae892 · aurelien.bellet · b8f98bcc · d41ae892
Commit d41ae892 authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -235,7 +235,7 @@ study in  Section~\ref{section:non-clustered}. We review some related work in
 \label{section:problem}
-We consider a set of $n$ nodes $N = \{1, \dots, n \}$ seeking to
+We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
 collaboratively solve a classification task with $c$ classes.
 % where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
 %AMK:explain the weight
@@ -276,12 +276,13 @@ This weighted average is defined by a
 mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
 the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $
 \{i,j\}\notin
-E$. To ensure that the local models converge on average to a (local) optimum
+E$. To ensure that the local models converge on average to a stationary
+point
 of Problem
 \eqref{eq:dist-optimization-problem}, $W$
 must be doubly
 stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and
-symmetric, i.e. $W_{ij} = W_{ji}$, see \cite{lian2017d-psgd}.
+symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}.
 \begin{algorithm}[t]
   \caption{D-SGD, Node $i$}
@@ -349,7 +350,7 @@ training set for training and validation respectively. The remaining 5k
 training examples were randomly removed to ensure all 10 classes are balanced
 while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
 We use all 10k examples of
-the test set to measure test accuracy. For CIFAR10, classes are evenly
+the test set to measure prediction accuracy. For CIFAR10, classes are evenly
 balanced: we use 45k/50k images of the original training set for training,
 5k/50k for validation, and all 10k examples of the test set for measuring
 prediction accuracy.
@@ -365,8 +366,8 @@ centralized setting.
 % compared to the 99\% achieved by start-of-the-art.
 These models are thus reasonably accurate (which is sufficient to
 study the effect of the topology) while being sufficiently fast to train in a
-fully decentralized setting, and are simple enough to configure and analyze.
+fully decentralized setting and simple enough to configure and analyze.
-Regarding hyper-parameters, we jointly optimized the learning rate and
+Regarding hyper-parameters, we jointly optimize the learning rate and
 mini-batch size on the
 validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
 MNIST and $0.002$ and $20$ for CIFAR10.
@@ -379,15 +380,15 @@ average) as a function of the number of times each example of the dataset has
 been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
 To further make results comparable across different number of nodes, we lower
 the batch size proportionally to the number of nodes added, and inversely,
-e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This
+e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
-ensures the same number of model updates and averaging per epoch, which is is
+ensures the same number of model updates and averaging per epoch, which is
 important to have a fair comparison.\footnote{Updating and averaging models
 after every example can eliminate the impact of local class bias. However, the
 resulting communication overhead is impractical.}
 Finally, we compare our results against an ideal baseline: either a
 fully-connected network topology with the same number of nodes or a single IID
-node. In both approaches, the topology has no effect on
+node. In both cases, the topology has no effect on
 the optimization. For a certain choice of number of nodes and
 mini-batch size, both approaches are equivalent.  %ensure a single
 % model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored  using a single IID node for CIFAR10 for the sake of training speed.