@@ -254,7 +254,8 @@ We solve this problem by decoupling the gradient averaging from the weight avera
\end{algorithm}
\section{Applications}
\section{Evaluation}
\subsection{MNIST and Linear Model}
...
...
@@ -264,16 +265,23 @@ We solve this problem by decoupling the gradient averaging from the weight avera
\caption{\label{fig:d-cliques-mnist-linear} D-Cliques with Linear Model on MNIST.}
\end{figure}
TODO: Update figure to use decoupled gradient averaging (will probably reduce variance and accelerate convergence speed)
TODO: Update figure to use decoupled gradient averaging (will probably reduce variance and accelerate convergence speed)\\
TODO: Add D-Cliques arranged in a Ring instead of Fully-Connected
\subsection{CIFAR10 and Convolutional Model}
Momentum, which reuses the Clique-Unbiased gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}.:
Momentum (CITE), which increases the magnitude of the components of the gradient that are shared between several consecutive steps, is critical for making convolutional networks converge quickly. However it relies on mini-batches to be IID, otherwise, it greatly increases variance between nodes and is actually detrimental to convergence speed.
Momentum can easily be used with D-Cliques, simply by calculating it from the clique-unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)}\leftarrow m v_i^{(k-1)} + g_i^{(k)}
\end{equation}
It then suffices to modify the original gradient step to use momentum:
In addition, it is important that all nodes are initialized with the same model value at the beginning. Otherwise, the random initialization of models introduces another source of variance that persists many steps. In combination with D-Clique (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), this provides the convergence results of Figure~\ref{fig:d-cliques-cifar10-convolutional}. To assess how far this would be from an "optimal" solution, in which the delay introduced by multiple hops between nodes is completely removed, we also show the convergence speed of a single node that would compute its average gradient from all the samples obtained by all nodes in a single round. The results show that minus the variance introduced by having multiples nodes, the convergence speed on average is close to the optimal.
In addition, it is important that all nodes are initialized with the same model value at the beginning. Otherwise, the random initialization of models introduces another source of variance that persists over many steps. In combination with D-Clique (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), this provides the convergence results of Figure~\ref{fig:d-cliques-cifar10-convolutional}. To assess how far this would be from an "optimal" solution, in which the delay introduced by multiple hops between nodes is completely removed, we also show the convergence speed of a single node that would compute its average gradient from all the samples obtained by all nodes in a single round. The results show that minus the variance introduced by the multiple hops between nodes, which slows the convergence of the distributed averaging of models, the convergence speed on average is close to the optimal, when the distributed average is computed exactly every step.
\begin{figure}[htbp]
\centering
...
...
@@ -290,21 +298,32 @@ In addition, it is important that all nodes are initialized with the same model
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques with Convolutional Network on CIFAR10.}
\end{figure}
\section{Evaluation}
\subsection{Effect of Scaling}
TODO: Add D-Cliques arranged in a Ring instead of Fully-Connected
\subsection{Comparison to similar topologies}
Similar number of maximum hops but no or less clustering than D-Cliques (and no unbiasing of gradient).
\begin{itemize}
\item Uniform Diverse Neighbourhood with No Clustering
\item Random network
\item Random Small-World Graph
\end{itemize}
\subsection{Relaxing Clique Connectivity}
\subsection{Effect of Scaling}
Show scaling effect for 10, 100, 1000 nodes (with decreasing sample sizes) for Clique Ring, Hierarchical, Fully-Connected.
\section{Future Work}
\begin{itemize}
\item Non-uniform Class Representation
\item End-to-End Wall-Clock Training Time, including Clique Formation
\item Comparison to Shuffling Data in a Data Center
\item Behaviour in the Presence of Churn
\item Relaxing Clique Connectivity: Randomly choose a subset of clique neighbours to compute average gradient.