diff --git a/main.tex b/main.tex index 987f5b7b54e7b485796d9040bd18bfe1afc8483e..56dac9e4c5024e600e7db0e7e02d542011fc23d8 100644 --- a/main.tex +++ b/main.tex @@ -130,7 +130,7 @@ speed compared to using denser topologies. % privacy protection \cite{amp_dec}. In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated -in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or +in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or a grid topology clearly jeopardizes the convergence speed as local distributions do not have relative frequency of classes similar to the global distribution, i.e. they exhibit \textit{local class bias}. We stress the fact @@ -380,8 +380,7 @@ been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classi To further make results comparable across different number of nodes, we lower the batch size proportionally to the number of nodes added, and inversely, e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This -ensures that the number of model updates and averaging per epoch remains the -same, which is +ensures the same number of model updates and averaging per epoch, which is important to have a fair comparison.\footnote{Updating and averaging models after every example can eliminate the impact of local class bias. However, the resulting communication overhead is impractical.} @@ -483,7 +482,7 @@ topology, namely: We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix for a formal account of D-Cliques construction. We note that it only requires the knowledge of the local class distribution at each node. For the sake of -simplicity, we assume that D-Cliques are constructed from the global +simplicity, we assume that D-Cliques is constructed from the global knowledge of these distributions, which can easily be obtained by decentralized averaging in a pre-processing step. @@ -543,12 +542,10 @@ introduced by inter-clique edges. We address this issue in the next section. \section{Optimizing with Clique Averaging and Momentum} \label{section:clique-averaging-momentum} -In this section, we present Clique Averaging, a feature that we add to -D-SGD to remove the bias caused by the inter-cliques edges of -D-Cliques. We then show how this can be used to successfully implement -momentum +In this section, we present Clique Averaging. This feature, when added to D-SGD, +removes the bias caused by the inter-cliques edges of +D-Cliques. We also show how it can be used to successfully implement momentum for non-IID data. -%AMK: check \subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges} \label{section:clique-averaging} @@ -745,7 +742,7 @@ convergence. Crucially, all random topologies fail to converge to a good solution. This confirms that our clique structure is important to reduce variance across nodes and improve the convergence. The difference with the previous -experiment appears to be due to both the use of a higher capacity model and to +experiment seems to be due to both the use of a higher capacity model and to the intrinsic characteristics of the datasets. % We refer % to the appendix for results on MNIST with LeNet. @@ -855,8 +852,7 @@ number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96\% reduction in the number of messages (37.8 messages per round per node on average instead of 999). We refer to Appendix~\ref{app:scaling} for additional results comparing the convergence -speed across -different number of nodes. Overall, our results +speed across different number of nodes. Overall, these results show that D-Cliques can nicely scale with the number of nodes. \begin{figure}[t] @@ -937,7 +933,7 @@ networks. We do not modify the simple and efficient D-SGD algorithm \cite{lian2017d-psgd} beyond removing some neighbor contributions -that would otherwise bias the direction of the gradient. +that otherwise bias the gradient direction. % An originality of our approach is to focus on the effect of topology % level without significantly changing the original simple and efficient D-SGD