@@ -1231,16 +1231,39 @@ We conclude that Uniform Initialization is not so important for convergence spee
\clearpage
\subsection{Scaling behaviour as the number of nodes increases}
\subsection{Scaling Behavior with Increasing Number of Nodes}
Section~\ref{section:interclique-topologies} compares the convergence speed of various interclique topologies at a scale of 1000 nodes. In this section, we show the effect of scaling the number of nodes, by comparing the convergence speed with 1, 10, 100, and 1000 nodes, and adjusting the batch size to maintain a constant number of updates per epoch. We present results for Ring, Fractal, Smallworld, and Fully-Connected Cliques interclique topologies.
Section~\ref{section:interclique-topologies} compares the convergence speed of various inter-clique topologies at a scale of 1000 nodes. In this section, we show the effect of scaling the number of nodes, by comparing the convergence speed with 1, 10, 100, and 1000 nodes, and adjusting the batch size to maintain a constant number of updates per epoch. We present results for Ring, Fractal, Small-world, and Fully-Connected inter-clique topologies.
Figure~\ref{fig:d-cliques-mnist-scaling-fully-connected} shows results for MNIST. For all topologies, we notice a perfect scaling up to 100 nodes, i.e. the accuracy curves overlap, with low variance between nodes. Starting at 1000 nodes, there is a significant increase in variance between nodes and the convergence is slower, only marginally for Fully-Connected Cliques but signifiantly so for Fractal and Ring. Smallworld has higher variance between nodes but has a convergence speed close to that of Fully-Connected Cliques.
Figure~\ref{fig:d-cliques-mnist-scaling-fully-connected} shows the results for
MNIST. For all topologies, we notice a perfect scaling up to 100 nodes, i.e.
the accuracy curves overlap, with low variance between nodes. Starting at 1000
nodes, there is a significant increase in variance between nodes and the
convergence is slower, only marginally for Fully-Connected but
significantly so for Fractal and Ring. Small-world has higher variance between nodes but maintains a convergence speed close to that of Fully-Connected.
Figure~\ref{fig:d-cliques-cifar10-scaling-fully-connected} shows the results
for CIFAR10. When increasing from 1 to 10 nodes (resulting in a single
fully-connected clique), there is actually a small increase both in final
accuracy and convergence speed. We believe this increase is due to the
gradient being computed with exactly the same number of examples from all
classes with 10 fully-connected non-IID nodes, while the gradient for a single
non-IID node may have a slightly larger bias because the random sampling does
not guarantee the representation of all classes perfectly in each batch. At a
scale of 100 nodes, there is no difference between Fully-Connected and
Fractal, as the connections are the same; however, a Ring already shows a
significantly slower convergence. At 1000 nodes, the convergence significantly
slows down for Fractal and Ring, while remaining close, albeit with a larger
variance, for Fully-Connected. Similar to MNIST, Small-world has
higher variance and slightly lower convergence speed than Fully-Connected but
remains very close.
We therefore conclude that Fully-Connected and Small-world have good scaling
properties in terms of convergence speed, and that the
linear-logarithmic number of edges of Small-world makes it the best compromise
between convergence speed and connectivity, and thus the best choice for
efficient large-scale decentralized learning in practice.
Figure~\ref{fig:d-cliques-cifar10-scaling-fully-connected} shows results for CIFAR10. When increasing from 1 to 10 nodes, which results in a single fully-connected clique, there is actually a small increase both in final accuracy and convergence speed. We believe this increase is due to the gradient being computed with exactly the same number of examples for all classes with 10 fully-connected non-IID nodes, while the gradient for a single non-IID node may have a slightly bigger bias because the random sampling does not guarantee the representation of all classes exactly equally. At a scale of 100 nodes, there is no difference between Fully-Connected Cliques and Fractal, as the connections are the same; however, a Ring already shows a significantly slower convergence. At 1000 nodes, the convergence significantly slows for Fractal and Ring, while remaining close, albeit with a larger variance, for Fully-Connected Cliques. Similar to MNIST, Smallworld has higher variance and lower convergence speed than Fully-Connected Topology but remains close.
We therefore conclude that Fully-Connected Cliques and Smallworld have good scaling properties in terms of convergence speed, and that Smallworld, with its linear-logarithmic scaling, is therefore a good compromise between convergence speed and number of edges required.
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/scaling
...
...
@@ -1248,7 +1271,7 @@ We therefore conclude that Fully-Connected Cliques and Smallworld have good scal