diff --git a/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png b/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png index f022706391c79fa857de417027a35fbfbb52b8a0..9bd7f7be0d0325d76e3f5f80b0af6a0bd14764ca 100644 Binary files a/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png and b/figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png differ diff --git a/figures/d-cliques-cifar10-1000-vs-1-node-training-loss.png b/figures/d-cliques-cifar10-1000-vs-1-node-training-loss.png index 111ecf77c48c0b3c03bb20df468adf00edb13b46..f82ce9047d46dbd27ff3def129bd7a9ab272ef13 100644 Binary files a/figures/d-cliques-cifar10-1000-vs-1-node-training-loss.png and b/figures/d-cliques-cifar10-1000-vs-1-node-training-loss.png differ diff --git a/main.tex b/main.tex index 198fbf779a0e765f122e2b5addbeeca184da7e60..bebaca5ee6ed9a049dfe80494230bef01f5c2c51 100644 --- a/main.tex +++ b/main.tex @@ -149,8 +149,9 @@ To summarize, our contributions are as follows: \begin{enumerate} \item we show the significant impact of topology on convergence speed in the presence of non-IID data in decentralized learning; \item we propose the D-Cliques topology to remove the impact of non-IID data on convergence speed, similar to a fully-connected topology. At a scale of 1000 nodes, this represents a 98\% reduction in the number of edges ($18.9$ vs $999$ edges per node on average) and a 96\% reduction in the total number of required messages; - \item we show how to leverage D-Cliques to implement momentum, a critical optimization technique to quickly train convolutional networks, that otherwise significantly \textit{decreases} convergence speed in the presence of non-IID data; - \item we show that, among the many possible choices of inter-clique topologies, a smallworld topology provides a convergence speed close to fully-connecting all cliques pairwise, but requires only $O(n + log(n))$ instead of $O(n^2)$ edges where $n$ is the number of nodes. At a scale of 1000 nodes, this represents a 22\% reduction in the number of edges compared to fully-connecting cliques ($14.6$ vs $18.9$ edges per node on average) and suggests possible bigger gains at larger scales. + \item we show how to leverage cliques to: (1) remove gradient bias that originate from inter-clique edges; + (2) implement momentum, a critical optimization technique to quickly train convolutional networks, that otherwise significantly \textit{decreases} convergence speed in the presence of non-IID data; + \item we show that, among the many possible choices of inter-clique topologies, a smallworld topology provides a convergence speed close to fully-connecting all cliques pairwise, but requires only $O(n + log(n))$ instead of $O(n^2)$ edges where $n$ is the number of nodes. At a scale of 1000 nodes, this represents a further 22\% reduction in the number of edges compared to fully-connecting cliques ($14.6$ vs $18.9$ edges per node on average) and suggests possible bigger gains at larger scales. \end{enumerate} The rest of this paper is organized as follows. \aurelien{TO COMPLETE} @@ -526,7 +527,7 @@ In addition, it is important that all nodes are initialized with the same model \begin{figure}[htbp] \centering % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --add-min-max --yaxis training-loss --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'upper right' --ymax 3 --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-training-loss.png +% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --add-min-max --yaxis training-loss --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'upper right' --ymax 3 --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-training-loss.png \begin{subfigure}[b]{0.48\textwidth} \centering \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-training-loss} @@ -534,7 +535,7 @@ In addition, it is important that all nodes are initialized with the same model \end{subfigure} \hfill % To regenerate the figure, from directory results/cifar10 -% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --add-min-max --yaxis test-accuracy --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png +% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --add-min-max --yaxis test-accuracy --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png \begin{subfigure}[b]{0.48\textwidth} \centering \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy}