diff --git a/main.tex b/main.tex index 0887b8242aeeaece2d8accca3084ffe79da7cee9..d39d217e1e27f81fa1b979961347439a76fa67ac 100644 --- a/main.tex +++ b/main.tex @@ -235,7 +235,7 @@ study in Section~\ref{section:non-clustered}. We review some related work in \label{section:problem} -We consider a set of $n$ nodes $N = \{1, \dots, n \}$ seeking to +We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to collaboratively solve a classification task with $c$ classes. % where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection. %AMK:explain the weight @@ -276,12 +276,13 @@ This weighted average is defined by a mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $ \{i,j\}\notin -E$. To ensure that the local models converge on average to a (local) optimum +E$. To ensure that the local models converge on average to a stationary +point of Problem \eqref{eq:dist-optimization-problem}, $W$ must be doubly stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and -symmetric, i.e. $W_{ij} = W_{ji}$, see \cite{lian2017d-psgd}. +symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}. \begin{algorithm}[t] \caption{D-SGD, Node $i$} @@ -349,7 +350,7 @@ training set for training and validation respectively. The remaining 5k training examples were randomly removed to ensure all 10 classes are balanced while ensuring the dataset is evenly divisible across 100 and 1000 nodes. We use all 10k examples of -the test set to measure test accuracy. For CIFAR10, classes are evenly +the test set to measure prediction accuracy. For CIFAR10, classes are evenly balanced: we use 45k/50k images of the original training set for training, 5k/50k for validation, and all 10k examples of the test set for measuring prediction accuracy. @@ -365,8 +366,8 @@ centralized setting. % compared to the 99\% achieved by start-of-the-art. These models are thus reasonably accurate (which is sufficient to study the effect of the topology) while being sufficiently fast to train in a -fully decentralized setting, and are simple enough to configure and analyze. -Regarding hyper-parameters, we jointly optimized the learning rate and +fully decentralized setting and simple enough to configure and analyze. +Regarding hyper-parameters, we jointly optimize the learning rate and mini-batch size on the validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for MNIST and $0.002$ and $20$ for CIFAR10. @@ -379,15 +380,15 @@ average) as a function of the number of times each example of the dataset has been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution. To further make results comparable across different number of nodes, we lower the batch size proportionally to the number of nodes added, and inversely, -e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This -ensures the same number of model updates and averaging per epoch, which is is +e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This +ensures the same number of model updates and averaging per epoch, which is important to have a fair comparison.\footnote{Updating and averaging models after every example can eliminate the impact of local class bias. However, the resulting communication overhead is impractical.} Finally, we compare our results against an ideal baseline: either a fully-connected network topology with the same number of nodes or a single IID -node. In both approaches, the topology has no effect on +node. In both cases, the topology has no effect on the optimization. For a certain choice of number of nodes and mini-batch size, both approaches are equivalent. %ensure a single % model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.