@@ -231,38 +231,56 @@ study in Section~\ref{section:non-clustered}. We review some related work in
\label{section:problem}
We consider a set of $n$ nodes $N =\{1, \dots, n \}$ where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij}=0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
We consider a set of $n$ nodes $N =\{1, \dots, n \}$ seeking to
collaboratively solve a classification task with $c$ classes.
% where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
%AMK:explain the weight
%Training data is sampled from a global distribution $D$ unknown to the nodes.
%AMK:Removed the sentence above
We assume that each node has access to an arbitrary partition of the samples that follows its own local distribution $D_i$. Nodes cooperate to% reach consensus
converge on a global model $M$ that performs well on $D$ by minimizing the average training loss on local models:
Each node has access to a local dataset that
follows its own local distribution $D_i$. The goal is to find a global model
$x$ that performs well on the union of the local distributions by minimizing
where $s_i$ is a data sample of $D_i$, $F_i$ is the loss function
on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(x;s_i)$ denotes the
expected loss of model $x$ on a random sample $s_i$ drawn from $D_i$.
such that $M= x_i = x_j, \forall i,j \in N$, where $x_i$ are the parameters of
node $i$'s local model, $s_i$ is a sample of $D_i$, $F_i$ is the loss function
on node $i$, and $\mathds{E}_{s_i \sim D_i} F_i(x_i;s_i)$ denotes the
expected value of $F_i$ on a random sample $s_i$ drawn from $D_i$.
To collaboratively solve Problem \eqref{eq:dist-optimization-problem}, each
node can exchange messages with its neighbors in an undirected network graph
$G(N,E)$ where $\{i,j\}\in E$ denotes an edge (communication channel)
between nodes $i$ and $j$.
\subsection{Learning Algorithm}
\subsection{Training Algorithm}
%AMK: if we need space this could be a paragraph
We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{lian2017d-psgd}, illustrated in Algorithm~\ref{Algorithm:D-PSGD}.
In this work, we use the popular Decentralized Stochastic
Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
shown in Algorithm~\ref{Algorithm:D-PSGD},
%AMK: can we say why: most popular, most efficient ?
A single step consists of sampling the local distribution $D_i$, computing and applying a stochastic gradient descent (SGD) step on that sample, and averaging the model with its neighbours. Both outgoing and incoming weights of $W$ must sum to 1, i.e. $W$ is doubly stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$), and communication is symmetric, i.e. $W_{ij}= W_{ji}$.
\begin{algorithm}[h]
\caption{D-PSGD, Node $i$}
a single step of D-SGD at node $i$ consists of sampling a mini-batch
from its local distribution
$D_i$, taking stochastic gradient descent (SGD) step according to this
sample, and performing a weighted average of its model with its neighbors.
This weighted average is defined by a
mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
the outgoing connection from node $i$ to $j$ and $W_{ij}=0$ for $
\{i,j\}\notin
E$. To ensure that D-SGD converges to a (local) optimum, $W$ must be doubly
stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$) and symmetric, i.e. $W_{ij}= W_{ji}$.
\begin{algorithm}[t]
\caption{D-SGD, Node $i$}
\label{Algorithm:D-PSGD}
\begin{algorithmic}[1]
\State\textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$
\State\textbf{Require:} initial model parameters $x_i^{(0)}$,
learning rate $\gamma$, mixing weights $W$, number of steps $K$
@@ -271,38 +289,104 @@ A single step consists of sampling the local distribution $D_i$, computing and a
\subsection{Methodology}
\subsubsection{Non-IID Assumptions}
\subsubsection{Non-IID assumptions.}
\label{section:non-iid-assumptions}
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, removing the assumption of IID data significantly challenges the learning algorithm. In this paper, we focus on an \textit{extreme case of local class bias}, and consider each node to have only samples
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of local class bias}: we
consider that each node only has samples
%examples
of a single class. Our results should generalize to lesser, and more frequent, cases.
from a single class.
% Our results should generalize to lesser, and more
% frequent, cases.
%AMK: a bit weak can't we say our results generalize....
%: e.g., if some classes are globally less represented, the position of the nodes with the rarest classes will be significant; and if two local datasets have different number of examples, the examples in the smaller dataset may be visited more often than those in a larger dataset, skewing the optimization process.
To isolate the effect of local class bias from other potentially compounding factors, we make the following assumptions: (1) All classes are equally represented in the global dataset, by randomly removing examples from the larger classes if necessary; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.
These assumptions are reasonable because: (1) Global class imbalance equally affects the optimization process on a single node and is therefore not specific to a decentralized setting; (2) Our results do not exploit specific positions in the topology; (3) Nodes with less examples could simply skip some rounds until the nodes with more examples catch up. Our results can therefore be extended to support additional compounding factors in future work.
\subsubsection{Experimental Settings}
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.
We believe that these assumptions are reasonable in the context of our study
because: (1)
Global class
imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology; (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions.
% with less examples could
% simply skip some rounds until the nodes with more examples catch up.
Our results can be extended to support additional compounding factors in future work.
\subsubsection{Experimental setup.}
\label{section:experimental-settings}
%AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval.
We focus on fairly comparing the convergence speed of different topologies and algorithm variations, to show that our approach can remove much of the effect of local class bias.
Our main goal is to provide a fair comparison of the convergence speed of
different topologies and algorithmic variations, to show that our approach
can remove much of the effect of local class bias.
We experiment with two datasets: MNIST~\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k samples from the original 60k
training set for training and validation respectively. The remaining 5k
training samples were randomly removed to ensure all 10 classes are balanced
while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of
the test set to measure test accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy.
We
use a logistic regression classifier for MNIST, which
provides up to 92.5\% percent accuracy in the centralized setting.
% compared to
% $99\%$ for the state-of-the-art~\cite{mnistWebsite}.
For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
deep convolutional network which achieves an accuracy of $72.3\%$ in the
centralized setting.
% compared to the 99\% achieved by start-of-the-art.
These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a
fully decentralized setting, and are simple enough to configure and analyze.
Regarding hyper-parameters, we jointly optimized the learning rate and
mini-batch size on the
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
For CIFAR10, we additionally use a momentum of $0.9$.
We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
To ignore the impact of distributed execution strategies and system
optimization techniques, we report the test accuracy of all nodes (min, max,
average) as a function of the number of times each example of the dataset has
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is is
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}
Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID
node. In both approaches, the topology has no effect on
the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent. %ensure a single
% model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.
To remove the impact of distributed execution strategies and system optimization techniques, we report the prediction accuracy of all nodes (min, max, average) as a function of the number of times each example of the dataset has been sampled by one and only one node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution. We report accuracy in percentage of examples, not used for training, that are correctly classified.
To ensure results generalize to multiple datasets, we test with both MNIST~\cite{mnistWebsite} and CIFAR10~\cite{krizhevsky2009learning}, both with 10 classes. We evaluate a linear (regression) model on MNIST, which provides up to 92.5\% percent accuracy in the best case compared to $99\%$ for the state-of-the-art~\cite{mnistWebsite}. For CIFAR10, we use a Group-Normalized variation of LeNet~\cite{quagmire}, a deep convolutional network, which achieves an accuracy of $72.3\%$ on a single IID node, compared to the 99\% achieved by start-of-the-art. In both cases, the resulting models are reasonably accurate, which is sufficient to study the effect of the topology, while being relatively quick to train and simple to configure and analyze.
We have jointly optimized the learning rate and minibatch size for 100 nodes, respectively obtaining $0.1$ and $128$ for MNIST and $0.002$ and $20$ for CIFAR10. For MNIST, we use 45k/60k examples of the original training set for training, 10k/60k exemples to select the best hyper-parameters, and all 10k examples of the test set to measure prediction accuracy. The remaining 5k/60k training examples are randomly removed to ensure all 10 classes have no more examples than the smallest class, while ensuring the dataset is evenly divisible between 100 and 1000 nodes. For CIFAR10, classes are evenly balanced: we use 45k/50k images of the original training set for training, 5k/50k to optimize hyper-parameters, and all 10k examples of the test set for measuring prediction accuracy.
To make results comparable between different number of nodes, we lower the batch size proportionally to the number of nodes added, and inversely, e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This ensures the same number of model updates per epoch, as this can have a stronger effect than other changes we are studying.\footnote{Updating models after every example can also eliminate the impact of local class bias. However, the resulting communication overhead is impractical.}
For CIFAR10, we additionally use a momentum of $0.9$, but we do not use momentum on MNIST as it has limited impact on a linear model.
We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes. We compare our results against either a fully-connected network topology with the same number of nodes or a single IID node. In both approaches, the topology has no effect on the optimization. %ensure a single model is optimized, which therefore removes the effect of the topology.
While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.