Skip to content
Snippets Groups Projects
Commit 43cd9379 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

AMK fixes

parent 2be9a4e9
No related branches found
No related tags found
No related merge requests found
......@@ -348,7 +348,7 @@ CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k examples from the original 60k
training set for training and validation respectively. The remaining 5k
training examples were randomly removed to ensure all 10 classes are balanced
while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of
the test set to measure prediction accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training,
......@@ -381,7 +381,8 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is
ensures that the number of model updates and averaging per epoch remains the
same, which is
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}
......@@ -543,9 +544,10 @@ introduced by inter-clique edges. We address this issue in the next section.
\section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum}
In this sectio, we present Clique Averaging, a simple modification of D-SGD
which removes the bias caused by the inter-cliques edges of
D-Cliques, and show how this can be used to successfully implement momentum
In this section, we present Clique Averaging, a feature that we add to
D-SGD in order to remove the bias caused by the inter-cliques edges of
D-Cliques. We then show how this can be used to successfully implement
momentum
for non-IID data.
%AMK: check
......@@ -802,7 +804,7 @@ average shortest path to $2$ between any pair of nodes. This choice requires $
in the number of nodes. This can become significant at larger scales when $n$ is
large compared to $c$.
In this last series of experiment, we evaluate the effect of choosing sparser
In this last series of experiments, we evaluate the effect of choosing sparser
inter-clique topologies on the convergence speed for a larger network of 1000
nodes. We compare the scalability and convergence speed of several
D-Cliques variants, which all use $O(nc)$ edges
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment