sec 3

29319c39 · aurelien.bellet · 64ffffd5 · 29319c39
Commit 29319c39 authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -408,16 +408,19 @@ mini-batch size, both approaches are equivalent.  %ensure a single
 In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
 % where  each color represents a class of data.
-The colors of a node represent the different classes it holds
+The colors of a node represent the different classes present in its local
-locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
+dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
 node has examples of all classes in equal proportions. In the non-IID setting 
 (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
 a
-single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
+single class and nodes are distributed randomly in the grid.
+A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
 In the IID case, since gradients are computed from examples of all classes,
-the resulting average gradient  points in a direction that reduces the
+the resulting averaged gradient  points in a direction that tends to reduce
-loss across all classes. In contrast, in the non-IID case, only a subset of classes are
+the loss across all classes. In contrast, in the non-IID case, only a subset
-represented in the immediate neighborhood of the node and the gradients will
+of classes are
+represented in the immediate neighborhood of the node, thus the gradients will
 be biased towards these classes. % more than in the IID case.
 Importantly, as the distributed averaging algorithm takes several steps to
 converge, this variance persists across iterations as the locally computed
@@ -442,7 +445,7 @@ impractical.
         \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
 \caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
     \end{subfigure}
-        \caption{Neighborhood in an IID and non-IID Grid.}
+        \caption{Neighborhood in an IID and non-IID grid.}
        \label{fig:grid-iid-vs-non-iid-neighbourhood}
 \end{figure}
@@ -463,9 +466,9 @@ edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
 corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
 classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
-The mixing matrix $W$ required by D-SGD is obtained from the above
+The mixing matrix $W$ required by D-SGD is obtained from standard
-topology using standard
+Metropolis-Hasting weights~\cite{xiao2004fast} computed from the above
-Metropolis-Hasting weights~\cite{xiao2004fast}:
+topology, namely:
 \begin{equation}
  W_{ij} = \begin{cases}
    \frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq
@@ -519,15 +522,15 @@ speed on MNIST.}
 \end{figure}
 Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
-performance D-Cliques on MNIST with $n=100$ nodes. The convergence speed is
+performance D-Cliques on MNIST with $n=100$ nodes. Observe that the
+convergence speed is
 very close
 to that of a fully-connected topology, and significantly better than with
 a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 
 100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
 compared to a fully-connected topology. Nonetheless, there is still
-significant variance in the accuracy across nodes, which we address in
+significant variance in the accuracy across nodes, which is due to the bias
-the next section by removing the bias introduced by inter-clique edges.
+introduced by inter-clique edges. We address this issue in the next section.
 %The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. 
 %
@@ -539,7 +542,7 @@ the next section by removing the bias introduced by inter-clique edges.
 \section{Optimizing with Clique Averaging and Momentum}
 \label{section:clique-averaging-momentum}
-In this section we present Clique Averaging, a simple modification of D-SGD
+In this sectio, we present Clique Averaging, a simple modification of D-SGD
 which removes the bias caused by the inter-cliques edges of
 D-Cliques, and show how this can be used to successfully implement momentum
 for non-IID data.