Skip to content
Snippets Groups Projects
Commit 43cd9379 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

AMK fixes

parent 2be9a4e9
No related branches found
No related tags found
No related merge requests found
...@@ -348,7 +348,7 @@ CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes. ...@@ -348,7 +348,7 @@ CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k examples from the original 60k For MNIST, we use 45k and 10k examples from the original 60k
training set for training and validation respectively. The remaining 5k training set for training and validation respectively. The remaining 5k
training examples were randomly removed to ensure all 10 classes are balanced training examples were randomly removed to ensure all 10 classes are balanced
while ensuring the dataset is evenly divisible across 100 and 1000 nodes. while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of We use all 10k examples of
the test set to measure prediction accuracy. For CIFAR10, classes are evenly the test set to measure prediction accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training, balanced: we use 45k/50k images of the original training set for training,
...@@ -381,7 +381,8 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi ...@@ -381,7 +381,8 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi
To further make results comparable across different number of nodes, we lower To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely, the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is ensures that the number of model updates and averaging per epoch remains the
same, which is
important to have a fair comparison.\footnote{Updating and averaging models important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.} resulting communication overhead is impractical.}
...@@ -543,9 +544,10 @@ introduced by inter-clique edges. We address this issue in the next section. ...@@ -543,9 +544,10 @@ introduced by inter-clique edges. We address this issue in the next section.
\section{Optimizing with Clique Averaging and Momentum} \section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum} \label{section:clique-averaging-momentum}
In this sectio, we present Clique Averaging, a simple modification of D-SGD In this section, we present Clique Averaging, a feature that we add to
which removes the bias caused by the inter-cliques edges of D-SGD in order to remove the bias caused by the inter-cliques edges of
D-Cliques, and show how this can be used to successfully implement momentum D-Cliques. We then show how this can be used to successfully implement
momentum
for non-IID data. for non-IID data.
%AMK: check %AMK: check
...@@ -802,7 +804,7 @@ average shortest path to $2$ between any pair of nodes. This choice requires $ ...@@ -802,7 +804,7 @@ average shortest path to $2$ between any pair of nodes. This choice requires $
in the number of nodes. This can become significant at larger scales when $n$ is in the number of nodes. This can become significant at larger scales when $n$ is
large compared to $c$. large compared to $c$.
In this last series of experiment, we evaluate the effect of choosing sparser In this last series of experiments, we evaluate the effect of choosing sparser
inter-clique topologies on the convergence speed for a larger network of 1000 inter-clique topologies on the convergence speed for a larger network of 1000
nodes. We compare the scalability and convergence speed of several nodes. We compare the scalability and convergence speed of several
D-Cliques variants, which all use $O(nc)$ edges D-Cliques variants, which all use $O(nc)$ edges
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment