Skip to content
Snippets Groups Projects
Commit d41ae892 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

typo sec 2

parent b8f98bcc
No related branches found
No related tags found
No related merge requests found
...@@ -235,7 +235,7 @@ study in Section~\ref{section:non-clustered}. We review some related work in ...@@ -235,7 +235,7 @@ study in Section~\ref{section:non-clustered}. We review some related work in
\label{section:problem} \label{section:problem}
We consider a set of $n$ nodes $N = \{1, \dots, n \}$ seeking to We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
collaboratively solve a classification task with $c$ classes. collaboratively solve a classification task with $c$ classes.
% where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection. % where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
%AMK:explain the weight %AMK:explain the weight
...@@ -276,12 +276,13 @@ This weighted average is defined by a ...@@ -276,12 +276,13 @@ This weighted average is defined by a
mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $ the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $
\{i,j\}\notin \{i,j\}\notin
E$. To ensure that the local models converge on average to a (local) optimum E$. To ensure that the local models converge on average to a stationary
point
of Problem of Problem
\eqref{eq:dist-optimization-problem}, $W$ \eqref{eq:dist-optimization-problem}, $W$
must be doubly must be doubly
stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and
symmetric, i.e. $W_{ij} = W_{ji}$, see \cite{lian2017d-psgd}. symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}.
\begin{algorithm}[t] \begin{algorithm}[t]
\caption{D-SGD, Node $i$} \caption{D-SGD, Node $i$}
...@@ -349,7 +350,7 @@ training set for training and validation respectively. The remaining 5k ...@@ -349,7 +350,7 @@ training set for training and validation respectively. The remaining 5k
training examples were randomly removed to ensure all 10 classes are balanced training examples were randomly removed to ensure all 10 classes are balanced
while ensuring the dataset is evenly divisible across 100 and 1000 nodes. while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of We use all 10k examples of
the test set to measure test accuracy. For CIFAR10, classes are evenly the test set to measure prediction accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training, balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring 5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy. prediction accuracy.
...@@ -365,8 +366,8 @@ centralized setting. ...@@ -365,8 +366,8 @@ centralized setting.
% compared to the 99\% achieved by start-of-the-art. % compared to the 99\% achieved by start-of-the-art.
These models are thus reasonably accurate (which is sufficient to These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a study the effect of the topology) while being sufficiently fast to train in a
fully decentralized setting, and are simple enough to configure and analyze. fully decentralized setting and simple enough to configure and analyze.
Regarding hyper-parameters, we jointly optimized the learning rate and Regarding hyper-parameters, we jointly optimize the learning rate and
mini-batch size on the mini-batch size on the
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10. MNIST and $0.002$ and $20$ for CIFAR10.
...@@ -379,15 +380,15 @@ average) as a function of the number of times each example of the dataset has ...@@ -379,15 +380,15 @@ average) as a function of the number of times each example of the dataset has
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution. been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely, the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is is ensures the same number of model updates and averaging per epoch, which is
important to have a fair comparison.\footnote{Updating and averaging models important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.} resulting communication overhead is impractical.}
Finally, we compare our results against an ideal baseline: either a Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID fully-connected network topology with the same number of nodes or a single IID
node. In both approaches, the topology has no effect on node. In both cases, the topology has no effect on
the optimization. For a certain choice of number of nodes and the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent. %ensure a single mini-batch size, both approaches are equivalent. %ensure a single
% model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed. % model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment