sec 3

014a9f4d · aurelien.bellet · 4882470f · 014a9f4d
Commit 014a9f4d authored 3 years ago by aurelien.bellet
--- a/main.tex
+++ b/main.tex
@@ -262,16 +262,22 @@ In this work, we use the popular Decentralized Stochastic
 Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
 shown in Algorithm~\ref{Algorithm:D-PSGD},
 %AMK: can we say why: most popular, most efficient ?
-a single step of D-SGD at node $i$ consists of sampling a mini-batch
+a single iteration of D-SGD at node $i$ consists of sampling a mini-batch
 from its local distribution
-$D_i$, taking stochastic gradient descent (SGD) step according to this
+$D_i$, updating its local model $x_i$ by taking a stochastic gradient descent 
-sample, and performing a weighted average of its model with its neighbors.
+(SGD) step according to this
+sample, and performing a weighted average of its local model with those of its
+neighbors.
 This weighted average is defined by a
 mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
 the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $
 \{i,j\}\notin
-E$. To ensure that D-SGD converges to a (local) optimum, $W$ must be doubly
+E$. To ensure that the local models converge on average to a (local) optimum
-stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and symmetric, i.e. $W_{ij} = W_{ji}$.
+of Problem
+\eqref{eq:dist-optimization-problem}, $W$
+must be doubly
+stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and
+symmetric, i.e. $W_{ij} = W_{ji}$, see \cite{lian2017d-psgd}.
 \begin{algorithm}[t]
   \caption{D-SGD, Node $i$}
@@ -324,8 +330,9 @@ Our results can be extended to support additional compounding factors in future
 \label{section:experimental-settings}
 %AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval.
-Our main goal is to provide a fair comparison of the convergence speed of
+Our main goal is to provide a fair comparison of the convergence speed across
-different topologies and algorithmic variations, to show that our approach
+different topologies and algorithmic variations, in order to
+show that our approach
 can remove much of the effect of local class bias.
 We experiment with two datasets: MNIST~\cite{mnistWebsite} and
@@ -391,13 +398,30 @@ mini-batch size, both approaches are equivalent.  %ensure a single
 \section{D-Cliques: Creating Locally Representative Cliques}
 \label{section:d-cliques}
-In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID},  represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where  each color represents a class of data.
+In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
-The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has samples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. 
+% where  each color represents a class of data.
+The colors of a node represent the different classes it holds
+locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
+node has samples of all classes in equal proportions. In the non-IID setting 
+(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a
+single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
+In the IID case, since gradients are computed from examples of all classes,
+the resulting average gradient  points in a direction that reduces the
+loss across all classes. In contrast, in the non-IID case, only a subset of classes are
+represented in the immediate neighborhood of the node and the gradients will
+be biased towards these classes. % more than in the IID case.
+Importantly, as the distributed averaging algorithm takes several steps to
+converge, this variance persists across iterations as the locally computed
+gradients are far from the global average.\footnote{It is possible, but
+very costly, to mitigate this by performing a sufficiently large number of
+averaging steps between each gradient step.} This can significantly slow down
+convergence speed to the point of making decentralized optimization
+impractical.
 %For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. 
-\begin{figure}
+\begin{figure}[t]
     \centering
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
@@ -409,57 +433,91 @@ The colors of a node, represented as a circle, correspond to the different class
         \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
 \caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
     \end{subfigure}
-        \caption{Neighbourhood in an IID and non-IID Grid.}
+        \caption{Neighborhood in an IID and non-IID Grid.}
        \label{fig:grid-iid-vs-non-iid-neighbourhood}
 \end{figure}
-In the IID case, since gradients are computed from examples of all classes, the resulting average gradient  points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented. % more than in the IID case.
+In D-Cliques, we address the issues of non-iidness by carefully designing a
-In addition, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as  the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.
+network topology composed of \textit{cliques} and \textit{inter-clique
+connections}:
-In D-Cliques, we address the issues of non-iidness by carefully designing the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
 \begin{itemize}
- \item  D-Cliques recovers a balanced representation of classes, similar to that of the IID case, by modifying the topology such that each node is part of a \textit{clique} with neighbours representing all classes.
+ \item  D-Cliques recovers a balanced representation of classes, similar to
- \item To ensure all cliques converge, \textit{inter-clique connections} are introduced, established directly between nodes that are part of cliques.
+ that of the IID case, by constructing a topology such that each node is
+ part of a \textit{clique} with neighbors representing all classes.
+ \item To ensure a global consensus and convergence, 
+ \textit{inter-clique connections}
+ are introduced by connecting a small number of node pairs that are
+ part of  different cliques.
 \end{itemize}
-Because a joint location distribution $D_{\textit{clique}} = \sum_{i \in \textit{clique}} D_i$ is representative of the global distribution, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. Because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. \footnote{See Algorithm~\ref{Algorithm:D-Clique-Construction} in Appendix for set-based formulation of D-Cliques construction.}
+In the following, we introduce one inter-clique connection per node such that each clique has exactly one
+edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
-Finally, weights
+corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
-%AMK: explain weights
+classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
-are assigned to edges to ensure quick convergence. For this study we use Metropolis-Hasting weights~\cite{xiao2004fast}, which while not necessarily optimal, are quick to compute and still provide good convergence speed: 
+The mixing matrix $W$ required by D-SGD is obtained from the above
+topology using standard
+Metropolis-Hasting weights~\cite{xiao2004fast}:
 \begin{equation}
  W_{ij} = \begin{cases}
-    \frac{1}{max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq j, \text{and $\exists$ edge between $i$ and $j$}\\
+    \frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq
-   1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$ \\
+    j \text{ and } \{i,j\}\in E,\\
-   0 & \text{otherwise}
+   1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$, \\
+   0 & \text{otherwise}.
  \end{cases}
 \end{equation}
-Note that for the sake of simplicity we assume that the topology is generated while assuming a global knowledge of the class distribution. Relaxing this assumption is part of future work.
+We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
+for a formal account of D-Cliques construction. We note that it only requires
+the knowledge of the local class distribution at each node. For the sake of
+simplicity, we assume that D-Cliques is constructed from the global
+knowledge of these distributions, which can easily be obtained by
+decentralized averaging in a pre-processing step. 
+The key idea of D-Cliques is that because the clique-level distribution $D_{
+\textit{clique}} = \sum_{i
+\in \textit{clique}} D_i$ is representative of the global distribution,
+the local models of nodes across cliques remain rather close. Therefore, a
+sparse inter-clique topology can be used, significantly reducing the total
+number of edges without slowing down the convergence. Furthermore, the degree
+of each node in the network remains low and even, making the D-Cliques
+topology very well-suited to decentralized federated learning. 
 %We centrally generate the topology, which is then tested in a custom simulator. We expect our approach should be straightforward to adapt for a decentralized execution: the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}.
-\begin{figure}[htbp]
+\begin{figure}[t]
    \centering 
-    \begin{subfigure}[b]{0.4\textwidth}
+    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/fully-connected-cliques}
-    \caption{\label{fig:d-cliques-figure} D-Cliques Connected Pairwise}
+    \caption{\label{fig:d-cliques-figure} D-Cliques (fully-connected
+    cliques)}
    \end{subfigure}
    \hfill
    % To regenerate figure, from results/mnist
    % python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16
-    \begin{subfigure}[b]{0.55\textwidth}
+    \begin{subfigure}[b]{0.54\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/d-cliques-mnist-vs-fully-connected.png}
-    \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed on MNIST. Y-axis starts at 80.}
+    \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed
+    on MNIST}
    \end{subfigure}
-\caption{\label{fig:d-cliques-example} D-Cliques}
+\caption{\label{fig:d-cliques-example} D-Cliques topology and convergence
+speed on MNIST.}
 \end{figure}
-A network of 100 non-IID nodes with D-Cliques is illustrated in Figure~\ref{fig:d-cliques-figure}, with the convergence speed of Figure~\ref{fig:d-cliques-example-convergence-speed}. The convergence speed is quite close to that of a fully-connected topology, and significantly better than that of the ring and grid of Figure~\ref{fig:iid-vs-non-iid-problem}. At a scale of 100 nodes, it uses only $\approx10\%$ of the number of edges of a fully-connected topology, offering a reduction of $\approx90\%$. Nonetheless, there is still significant variance in accuracy between nodes, which we address in the next section by removing the bias actually introduced by inter-clique edges.
+Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
+performance D-Cliques on MNIST with $n=100$ nodes. The convergence speed is
+very close
+to that of a fully-connected topology, and significantly better than with
+a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 
+100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
+compared to a fully-connected topology. Nonetheless, there is still
+significant variance in the accuracy across nodes, which we address in
+the next section by removing the bias introduced by inter-clique edges.
 %The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.