@@ -235,7 +235,7 @@ study in Section~\ref{section:non-clustered}. We review some related work in
...
@@ -235,7 +235,7 @@ study in Section~\ref{section:non-clustered}. We review some related work in
\label{section:problem}
\label{section:problem}
We consider a set of $n$ nodes $N =\{1, \dots, n \}$ seeking to
We consider a set $N =\{1, \dots, n \}$of $n$ nodes seeking to
collaboratively solve a classification task with $c$ classes.
collaboratively solve a classification task with $c$ classes.
% where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
% where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
%AMK:explain the weight
%AMK:explain the weight
...
@@ -276,12 +276,13 @@ This weighted average is defined by a
...
@@ -276,12 +276,13 @@ This weighted average is defined by a
mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
the outgoing connection from node $i$ to $j$ and $W_{ij}=0$ for $
the outgoing connection from node $i$ to $j$ and $W_{ij}=0$ for $
\{i,j\}\notin
\{i,j\}\notin
E$. To ensure that the local models converge on average to a (local) optimum
E$. To ensure that the local models converge on average to a stationary
point
of Problem
of Problem
\eqref{eq:dist-optimization-problem}, $W$
\eqref{eq:dist-optimization-problem}, $W$
must be doubly
must be doubly
stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$) and
stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$) and
symmetric, i.e. $W_{ij}= W_{ji}$, see \cite{lian2017d-psgd}.
symmetric, i.e. $W_{ij}= W_{ji}$~\cite{lian2017d-psgd}.
\begin{algorithm}[t]
\begin{algorithm}[t]
\caption{D-SGD, Node $i$}
\caption{D-SGD, Node $i$}
...
@@ -349,7 +350,7 @@ training set for training and validation respectively. The remaining 5k
...
@@ -349,7 +350,7 @@ training set for training and validation respectively. The remaining 5k
training examples were randomly removed to ensure all 10 classes are balanced
training examples were randomly removed to ensure all 10 classes are balanced
while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of
We use all 10k examples of
the test set to measure test accuracy. For CIFAR10, classes are evenly
the test set to measure prediction accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training,
balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring
5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy.
prediction accuracy.
...
@@ -365,8 +366,8 @@ centralized setting.
...
@@ -365,8 +366,8 @@ centralized setting.
% compared to the 99\% achieved by start-of-the-art.
% compared to the 99\% achieved by start-of-the-art.
These models are thus reasonably accurate (which is sufficient to
These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a
study the effect of the topology) while being sufficiently fast to train in a
fully decentralized setting, and are simple enough to configure and analyze.
fully decentralized setting and simple enough to configure and analyze.
Regarding hyper-parameters, we jointly optimized the learning rate and
Regarding hyper-parameters, we jointly optimize the learning rate and
mini-batch size on the
mini-batch size on the
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
MNIST and $0.002$ and $20$ for CIFAR10.
...
@@ -379,15 +380,15 @@ average) as a function of the number of times each example of the dataset has
...
@@ -379,15 +380,15 @@ average) as a function of the number of times each example of the dataset has
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is is
ensures the same number of model updates and averaging per epoch, which is
important to have a fair comparison.\footnote{Updating and averaging models
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}
resulting communication overhead is impractical.}
Finally, we compare our results against an ideal baseline: either a
Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID
fully-connected network topology with the same number of nodes or a single IID
node. In both approaches, the topology has no effect on
node. In both cases, the topology has no effect on
the optimization. For a certain choice of number of nodes and
the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent. %ensure a single
mini-batch size, both approaches are equivalent. %ensure a single
% model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.
% model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.