@@ -353,7 +353,7 @@ We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algor
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx80\%$. MNIST and a Linear model are relatively simple, so the next section shows to work with a harder dataset and a higher capacity model.
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx80\%$. MNIST and a Linear model are relatively simple, so the next section shows how to support a harder dataset and a higher capacity model.
\section{Implementing Momentum with Clique Averaging}