diff --git a/main.tex b/main.tex index 5fe6a94f4d84d453ab2afc208deb6c06a1041535..1f0cc1a1ae2147100d851745a471955fb1cd151d 100644 --- a/main.tex +++ b/main.tex @@ -353,7 +353,7 @@ We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algor \caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.} \end{figure} -As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. MNIST and a Linear model are relatively simple, so the next section shows to work with a harder dataset and a higher capacity model. +As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. MNIST and a Linear model are relatively simple, so the next section shows how to support a harder dataset and a higher capacity model. \section{Implementing Momentum with Clique Averaging} \label{section:momentum}