diff --git a/figures/d-clique-mnist-clique-avg.png b/figures/d-clique-mnist-clique-avg.png new file mode 100644 index 0000000000000000000000000000000000000000..e15a177a7bebd6f9716f8a47a816541912e16af5 Binary files /dev/null and b/figures/d-clique-mnist-clique-avg.png differ diff --git a/figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png b/figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png new file mode 100644 index 0000000000000000000000000000000000000000..a3f86a0788b4667838b46861e716aa2d02eabf69 Binary files /dev/null and b/figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png differ diff --git a/figures/d-cliques-cifar10-momentum-non-iid-effect.png b/figures/d-cliques-cifar10-momentum-non-iid-effect.png new file mode 100644 index 0000000000000000000000000000000000000000..ac11195ababb2f507516674108e7c9b18a66b439 Binary files /dev/null and b/figures/d-cliques-cifar10-momentum-non-iid-effect.png differ diff --git a/figures/d-cliques-mnist-vs-fully-connected.png b/figures/d-cliques-mnist-vs-fully-connected.png new file mode 100644 index 0000000000000000000000000000000000000000..8721f3fd8332d2acb0f65751f18cdf4417e52206 Binary files /dev/null and b/figures/d-cliques-mnist-vs-fully-connected.png differ diff --git a/figures/d-cliques-vs-fully-connected.png b/figures/d-cliques-vs-fully-connected.png new file mode 100644 index 0000000000000000000000000000000000000000..91ea3e9be3a1c6b8d8be0432389e272e50a6b801 Binary files /dev/null and b/figures/d-cliques-vs-fully-connected.png differ diff --git a/main.tex b/main.tex index 7df0cab2177b5416aa895f963e8e8c96ed3ae6e6..c4306e153f4963076298727a4526bf4fc202e05d 100644 --- a/main.tex +++ b/main.tex @@ -337,7 +337,7 @@ We solve this problem by decoupling the gradient averaging from the weight avera \end{figure} -\section{Implementing Momentum} +\subsection{Implementing Unbiased Momentum with Clique Averaging} Momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps, is critical for making convolutional networks converge quickly. However it relies on mini-batches to be IID, otherwise, it greatly increases variance between nodes and is actually detrimental to convergence speed. @@ -352,6 +352,27 @@ x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} In addition, it is important that all nodes are initialized with the same model value at the beginning. Otherwise, the random initialization of models introduces another source of variance that persists over many steps. In combination with D-Clique (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), this provides the convergence results of Figure~\ref{fig:d-cliques-cifar10-convolutional}. To assess how far this would be from an "optimal" solution, in which the delay introduced by multiple hops between nodes is completely removed, we also show the convergence speed of a single node that would compute its average gradient from all the samples obtained by all nodes in a single round. The results show that minus the variance introduced by the multiple hops between nodes, which slows the convergence of the distributed averaging of models, the convergence speed on average is close to the optimal, when the distributed average is computed exactly every step. +\begin{figure}[htbp] + \centering + % To regenerate figure, from results/cifar10 + % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET --legend 'center right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-cliques w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png + \begin{subfigure}[b]{0.45\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-effect} + \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-effect} Without Clique Averaging } + \end{subfigure} + \hfill + % To regenerate figure, from results/cifar10 + % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'lower right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png + \begin{subfigure}[b]{0.45\textwidth} + \centering + \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect} + \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging} + \end{subfigure} + +\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum} +\end{figure} + \section{Scaling with Different Inter-Clique Topologies} diff --git a/results/cifar10/fully-connected-cliques-no-momentum/experiments.sh b/results/cifar10/fully-connected-cliques-no-momentum/experiments.sh new file mode 100755 index 0000000000000000000000000000000000000000..acfd83c0ffb9eb32925dfebf653536f2be22d09b --- /dev/null +++ b/results/cifar10/fully-connected-cliques-no-momentum/experiments.sh @@ -0,0 +1,14 @@ +#!/usr/bin/env bash +TOOLS=../../../../learn-topology/tools; CWD="$(pwd)"; cd $TOOLS +BSZS=' + 20 + ' +LRS=' + 0.002 + ' +for BSZ in $BSZS; + do for LR in $LRS; + do python sgp-mnist.py --nb-nodes 100 --nb-epochs 100 --local-classes 1 --seed 1 --nodes-per-class 10 10 10 10 10 10 10 10 10 10 --global-train-ratios 1 1 1 1 1 1 1 1 1 1 --dist-optimization d-psgd --topology fully-connected-cliques --metric dissimilarity --learning-momentum 0.0 --sync-per-mini-batch 1 --results-directory $CWD/all --learning-rate $LR --batch-size $BSZ "$@" --single-process --nb-logging-processes 10 --dataset cifar10 --model gn-lenet --clique-gradient --initial-averaging --accuracy-logging-interval 10 --validation-set-ratio 0.5 + done; +done; +