From dcd6750df04ceeed3ad8acc01f6f65e236a0373f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr>
Date: Fri, 2 Apr 2021 12:39:46 +0200
Subject: [PATCH] sec 2

---
 main.tex | 154 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 119 insertions(+), 35 deletions(-)

diff --git a/main.tex b/main.tex
index 45e69d6..df7d577 100644
--- a/main.tex
+++ b/main.tex
@@ -231,38 +231,56 @@ study in  Section~\ref{section:non-clustered}. We review some related work in
 
 \label{section:problem}
 
-We consider a set of $n$ nodes $N = \{1, \dots, n \}$ where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
+We consider a set of $n$ nodes $N = \{1, \dots, n \}$ seeking to
+collaboratively solve a classification task with $c$ classes.
+% where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
 %AMK:explain the weight
-
 %Training data is sampled from a global distribution $D$ unknown to the nodes.
 %AMK:Removed the sentence above
-We assume that each node has access to an arbitrary partition of the samples that follows its own local distribution $D_i$. Nodes cooperate to% reach consensus
-converge on a global model $M$ that performs well on $D$ by minimizing the average training loss on local models:
-
+Each node has access to a local dataset that
+ follows its own local distribution $D_i$. The goal is to find a global model
+ $x$ that performs well on the union of the local distributions by minimizing
+ the average training loss:
 \begin{equation}
-min_{x_i, i = 1, \dots, n} = \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_{s_i \sim D_i} F_i(x_i;s_i) 
+\min_{x} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
+{s_i \sim D_i} [F_i(x;s_i)],
 \label{eq:dist-optimization-problem}
 \end{equation}
+where $s_i$ is a data sample of $D_i$, $F_i$ is the loss function
+on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(x;s_i)$ denotes  the
+expected loss of model $x$ on a random sample $s_i$ drawn from $D_i$.
 
-such that $M= x_i = x_j, \forall i,j \in N$, where $x_i$ are the parameters of
-node $i$'s local model, $s_i$ is a sample of $D_i$, $F_i$ is the loss function
-on node $i$, and $\mathds{E}_{s_i \sim D_i} F_i(x_i;s_i)$ denotes  the
-expected value of $F_i$ on a random sample $s_i$ drawn from $D_i$.
+To collaboratively solve Problem \eqref{eq:dist-optimization-problem}, each
+node can exchange messages with its neighbors in an undirected network graph
+$G(N,E)$ where $\{i,j\}\in E$ denotes an edge (communication channel)
+between nodes $i$ and $j$.
 
-\subsection{Learning Algorithm}
+\subsection{Training Algorithm}
 %AMK: if we need space this could be a paragraph
 
-We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{lian2017d-psgd}, illustrated in Algorithm~\ref{Algorithm:D-PSGD}.
+In this work, we use the popular Decentralized Stochastic
+Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
+shown in Algorithm~\ref{Algorithm:D-PSGD},
 %AMK: can we say why: most popular, most efficient ?
-A single step consists of sampling the local distribution $D_i$, computing and applying a stochastic gradient descent (SGD) step on that sample, and averaging the model with its neighbours. Both outgoing and incoming weights of $W$ must sum to 1, i.e. $W$  is doubly stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$), and communication is symmetric, i.e. $W_{ij} = W_{ji}$.
-
-\begin{algorithm}[h]
-   \caption{D-PSGD, Node $i$}
+a single step of D-SGD at node $i$ consists of sampling a mini-batch
+from its local distribution
+$D_i$, taking stochastic gradient descent (SGD) step according to this
+sample, and performing a weighted average of its model with its neighbors.
+This weighted average is defined by a
+mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
+the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $
+\{i,j\}\notin
+E$. To ensure that D-SGD converges to a (local) optimum, $W$ must be doubly
+stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and symmetric, i.e. $W_{ij} = W_{ji}$.
+
+\begin{algorithm}[t]
+   \caption{D-SGD, Node $i$}
    \label{Algorithm:D-PSGD}
    \begin{algorithmic}[1]
-        \State \textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$
+        \State \textbf{Require:} initial model parameters $x_i^{(0)}$,
+        learning rate $\gamma$, mixing weights $W$, number of steps $K$
         \For{$k = 1,\ldots, K$}
-          \State $s_i^{(k)} \gets \textit{sample from~} D_i$
+          \State $s_i^{(k)} \gets \text{(mini-batch) sample from~} D_i$
           \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$ 
           \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
         \EndFor
@@ -271,38 +289,104 @@ A single step consists of sampling the local distribution $D_i$, computing and a
 
 \subsection{Methodology}
 
-\subsubsection{Non-IID Assumptions}
+\subsubsection{Non-IID assumptions.}
 \label{section:non-iid-assumptions}
 
-As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, removing the assumption of IID data significantly challenges the learning algorithm. In this paper, we focus on an \textit{extreme case of local class bias}, and consider each node to have only samples
+As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
+assumption of IID data significantly challenges the learning algorithm. In
+this paper, we focus on an \textit{extreme case of local class bias}: we
+consider that each node only has samples
 %examples
-of a single class. Our results should generalize to lesser, and more frequent, cases.
+from a single class.
+% Our results should generalize to lesser, and more
+% frequent, cases.
 %AMK: a bit weak can't we say our results generalize....
-
 %: e.g., if some classes are globally less represented, the position of the nodes with the rarest classes will be significant; and if two local datasets have different number of examples, the examples in the smaller dataset may be visited more often than those in a larger dataset, skewing the optimization process. 
 
-To isolate the effect of local class bias from other potentially compounding factors, we make the following assumptions: (1) All classes are equally represented in the global dataset, by randomly removing examples from the larger classes if necessary; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples. 
-
-These assumptions are reasonable because: (1) Global class imbalance equally affects the optimization process on a single node and is therefore not specific to a decentralized setting; (2) Our results do not exploit specific positions in the topology;  (3) Nodes with less examples could simply skip some rounds until the nodes with more examples catch up. Our results can therefore be extended to support additional compounding factors in future work.
-
-\subsubsection{Experimental Settings}
+To isolate the effect of local class bias from other potentially compounding
+factors, we make the following simplifying assumptions: (1) All classes are
+equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.
+
+We believe that these assumptions are reasonable in the context of our study
+because: (1)
+Global class
+imbalance equally
+affects the optimization process on a single node and is therefore not
+specific to the decentralized setting; (2) Our results do not exploit specific
+positions in the topology;  (3) Imbalanced dataset sizes across nodes can be
+addressed for instance by appropriately weighting the individual loss
+functions.
+% with less examples could
+% simply skip some rounds until the nodes with more examples catch up.
+Our results can be extended to support additional compounding factors in future work.
+
+\subsubsection{Experimental setup.}
 \label{section:experimental-settings}
 %AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval.
 
-We focus on fairly comparing the convergence speed of different topologies and algorithm variations, to show that our approach can remove much of the effect of local class bias. 
+Our main goal is to provide a fair comparison of the convergence speed of
+different topologies and algorithmic variations, to show that our approach
+can remove much of the effect of local class bias.
+
+We experiment with two datasets: MNIST~\cite{mnistWebsite} and
+CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
+For MNIST, we use 45k and 10k samples from the original 60k
+training set for training and validation respectively. The remaining 5k
+training samples were randomly removed to ensure all 10 classes are balanced
+while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
+We use all 10k examples of
+the test set to measure test accuracy. For CIFAR10, classes are evenly
+balanced: we use 45k/50k images of the original training set for training,
+5k/50k for validation, and all 10k examples of the test set for measuring
+prediction accuracy.
+
+We
+use a logistic regression classifier for MNIST, which
+provides up to 92.5\% percent accuracy in the centralized setting.
+% compared to
+% $99\%$ for the state-of-the-art~\cite{mnistWebsite}.
+For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
+deep convolutional network which achieves an accuracy of $72.3\%$ in the
+centralized setting.
+% compared to the 99\% achieved by start-of-the-art.
+These models are thus reasonably accurate (which is sufficient to
+study the effect of the topology) while being sufficiently fast to train in a
+fully decentralized setting, and are simple enough to configure and analyze.
+Regarding hyper-parameters, we jointly optimized the learning rate and
+mini-batch size on the
+validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
+MNIST and $0.002$ and $20$ for CIFAR10.
+For CIFAR10, we additionally use a momentum of $0.9$.
+
+We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
+To ignore the impact of distributed execution strategies and system
+optimization techniques, we report the test accuracy of all nodes (min, max,
+average) as a function of the number of times each example of the dataset has
+been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
+To further make results comparable across different number of nodes, we lower
+the batch size proportionally to the number of nodes added, and inversely,
+e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This
+ensures the same number of model updates and averaging per epoch, which is is
+important to have a fair comparison.\footnote{Updating and averaging models
+after every example can eliminate the impact of local class bias. However, the
+resulting communication overhead is impractical.}
+
+Finally, we compare our results against an ideal baseline: either a
+fully-connected network topology with the same number of nodes or a single IID
+node. In both approaches, the topology has no effect on
+the optimization. For a certain choice of number of nodes and
+mini-batch size, both approaches are equivalent.  %ensure a single
+% model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored  using a single IID node for CIFAR10 for the sake of training speed.
+
+
+
+
 
-To remove the impact of distributed execution strategies and system optimization techniques, we report the prediction accuracy of all nodes (min, max, average) as a function of the number of times each example of the dataset has been sampled by one and only one node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution. We report accuracy in percentage of examples, not used for training, that are correctly classified.
 
-To ensure results generalize to multiple datasets, we test with both MNIST~\cite{mnistWebsite} and CIFAR10~\cite{krizhevsky2009learning}, both with 10 classes. We evaluate a linear (regression) model on MNIST, which provides up to 92.5\% percent accuracy in the best case compared to $99\%$ for the state-of-the-art~\cite{mnistWebsite}. For CIFAR10, we use a Group-Normalized variation of LeNet~\cite{quagmire}, a deep convolutional network, which achieves an accuracy of $72.3\%$ on a single IID node, compared to the 99\% achieved by start-of-the-art. In both cases, the resulting models are reasonably accurate, which is sufficient to study the effect of the topology, while being relatively quick to train and simple to configure and analyze.
 
-We have jointly optimized the learning rate and minibatch size for 100 nodes, respectively obtaining $0.1$ and $128$ for MNIST and $0.002$ and $20$ for CIFAR10. For MNIST, we use 45k/60k examples of the original training set for training, 10k/60k exemples to select the best hyper-parameters, and all 10k examples of the test set to measure prediction accuracy. The remaining 5k/60k training examples are randomly removed to ensure all 10 classes have no more examples than the smallest class, while ensuring the dataset is evenly divisible between 100 and 1000 nodes. For CIFAR10, classes are evenly balanced: we use 45k/50k images of the original training set for training, 5k/50k to optimize hyper-parameters, and all 10k examples of the test set for measuring prediction accuracy. 
 
-To make results comparable between different number of nodes, we lower the batch size proportionally to the number of nodes added, and inversely, e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This ensures the same number of model updates per epoch, as this can have a stronger effect than other changes we are studying.\footnote{Updating models after every example can also eliminate the impact of local class bias. However, the resulting communication overhead is impractical.}
 
-For CIFAR10, we additionally use a momentum of $0.9$, but we do not use momentum on MNIST as it has limited impact on a linear model.
 
-We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes. We compare our results against  either  a fully-connected network topology with the same number of nodes or a single IID node. In both approaches, the topology has no effect on the optimization.  %ensure a single model is optimized, which therefore removes the effect of the topology.
-While, both approaches compute an equivalent gradient with the same expectation, we favored  using a single IID node for CIFAR10 for the sake of training speed.
 
 \section{D-Cliques: Creating Locally Representative Cliques}
 \label{section:d-cliques}
-- 
GitLab