From 29319c399953400229abdd9a578430a8f5060e2b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr>
Date: Fri, 2 Apr 2021 18:16:01 +0200
Subject: [PATCH] sec 3

---
 main.tex | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/main.tex b/main.tex
index 620356d..3910478 100644
--- a/main.tex
+++ b/main.tex
@@ -408,16 +408,19 @@ mini-batch size, both approaches are equivalent.  %ensure a single
 
 In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
 % where  each color represents a class of data.
-The colors of a node represent the different classes it holds
-locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
+The colors of a node represent the different classes present in its local
+dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
 node has examples of all classes in equal proportions. In the non-IID setting 
 (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
 a
-single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
+single class and nodes are distributed randomly in the grid.
+
+A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
 In the IID case, since gradients are computed from examples of all classes,
-the resulting average gradient  points in a direction that reduces the
-loss across all classes. In contrast, in the non-IID case, only a subset of classes are
-represented in the immediate neighborhood of the node and the gradients will
+the resulting averaged gradient  points in a direction that tends to reduce
+the loss across all classes. In contrast, in the non-IID case, only a subset
+of classes are
+represented in the immediate neighborhood of the node, thus the gradients will
 be biased towards these classes. % more than in the IID case.
 Importantly, as the distributed averaging algorithm takes several steps to
 converge, this variance persists across iterations as the locally computed
@@ -442,7 +445,7 @@ impractical.
          \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
 \caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
      \end{subfigure}
-        \caption{Neighborhood in an IID and non-IID Grid.}
+        \caption{Neighborhood in an IID and non-IID grid.}
         \label{fig:grid-iid-vs-non-iid-neighbourhood}
 \end{figure}
 
@@ -463,9 +466,9 @@ edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
 corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
 classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
 
-The mixing matrix $W$ required by D-SGD is obtained from the above
-topology using standard
-Metropolis-Hasting weights~\cite{xiao2004fast}:
+The mixing matrix $W$ required by D-SGD is obtained from standard
+Metropolis-Hasting weights~\cite{xiao2004fast} computed from the above
+topology, namely:
 \begin{equation}
   W_{ij} = \begin{cases}
     \frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq
@@ -519,15 +522,15 @@ speed on MNIST.}
 \end{figure}
 
 Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
-performance D-Cliques on MNIST with $n=100$ nodes. The convergence speed is
+performance D-Cliques on MNIST with $n=100$ nodes. Observe that the
+convergence speed is
 very close
 to that of a fully-connected topology, and significantly better than with
 a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 
 100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
 compared to a fully-connected topology. Nonetheless, there is still
-significant variance in the accuracy across nodes, which we address in
-the next section by removing the bias introduced by inter-clique edges.
-
+significant variance in the accuracy across nodes, which is due to the bias
+introduced by inter-clique edges. We address this issue in the next section.
 
 %The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. 
 %
@@ -539,7 +542,7 @@ the next section by removing the bias introduced by inter-clique edges.
 \section{Optimizing with Clique Averaging and Momentum}
 \label{section:clique-averaging-momentum}
 
-In this section we present Clique Averaging, a simple modification of D-SGD
+In this sectio, we present Clique Averaging, a simple modification of D-SGD
 which removes the bias caused by the inter-cliques edges of
 D-Cliques, and show how this can be used to successfully implement momentum
 for non-IID data.
-- 
GitLab