From 29319c399953400229abdd9a578430a8f5060e2b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr> Date: Fri, 2 Apr 2021 18:16:01 +0200 Subject: [PATCH] sec 3 --- main.tex | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/main.tex b/main.tex index 620356d..3910478 100644 --- a/main.tex +++ b/main.tex @@ -408,16 +408,19 @@ mini-batch size, both approaches are equivalent. %ensure a single In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. % where each color represents a class of data. -The colors of a node represent the different classes it holds -locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each +The colors of a node represent the different classes present in its local +dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a -single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. +single class and nodes are distributed randomly in the grid. + +A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. In the IID case, since gradients are computed from examples of all classes, -the resulting average gradient points in a direction that reduces the -loss across all classes. In contrast, in the non-IID case, only a subset of classes are -represented in the immediate neighborhood of the node and the gradients will +the resulting averaged gradient points in a direction that tends to reduce +the loss across all classes. In contrast, in the non-IID case, only a subset +of classes are +represented in the immediate neighborhood of the node, thus the gradients will be biased towards these classes. % more than in the IID case. Importantly, as the distributed averaging algorithm takes several steps to converge, this variance persists across iterations as the locally computed @@ -442,7 +445,7 @@ impractical. \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood} \caption{\label{fig:grid-non-iid-neighbourhood} Non-IID} \end{subfigure} - \caption{Neighborhood in an IID and non-IID Grid.} + \caption{Neighborhood in an IID and non-IID grid.} \label{fig:grid-iid-vs-non-iid-neighbourhood} \end{figure} @@ -463,9 +466,9 @@ edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$ classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}. -The mixing matrix $W$ required by D-SGD is obtained from the above -topology using standard -Metropolis-Hasting weights~\cite{xiao2004fast}: +The mixing matrix $W$ required by D-SGD is obtained from standard +Metropolis-Hasting weights~\cite{xiao2004fast} computed from the above +topology, namely: \begin{equation} W_{ij} = \begin{cases} \frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq @@ -519,15 +522,15 @@ speed on MNIST.} \end{figure} Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the -performance D-Cliques on MNIST with $n=100$ nodes. The convergence speed is +performance D-Cliques on MNIST with $n=100$ nodes. Observe that the +convergence speed is very close to that of a fully-connected topology, and significantly better than with a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 100 nodes, it offers a reduction of $\approx90\%$ in the number of edges compared to a fully-connected topology. Nonetheless, there is still -significant variance in the accuracy across nodes, which we address in -the next section by removing the bias introduced by inter-clique edges. - +significant variance in the accuracy across nodes, which is due to the bias +introduced by inter-clique edges. We address this issue in the next section. %The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. % @@ -539,7 +542,7 @@ the next section by removing the bias introduced by inter-clique edges. \section{Optimizing with Clique Averaging and Momentum} \label{section:clique-averaging-momentum} -In this section we present Clique Averaging, a simple modification of D-SGD +In this sectio, we present Clique Averaging, a simple modification of D-SGD which removes the bias caused by the inter-cliques edges of D-Cliques, and show how this can be used to successfully implement momentum for non-IID data. -- GitLab