Newer
Older
\section{Small-world Inter-clique Topology}
We present a more detailed and precise explanation the algorithm to establish a small-world
inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}). Algorithm~\ref{Algorithm:Smallworld} instantiates the function
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
\textit{interconnect} with a
small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
\begin{algorithm}[h]
\caption{$\textit{smallworld}(DC)$: adds $O(\# N \log(\# N))$ edges}
\label{Algorithm:Smallworld}
\begin{algorithmic}[1]
\STATE \textbf{Require:} set of cliques $DC$ (set of set of nodes)
\STATE ~~size of neighborhood $ns$ (default 2)
\STATE ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
\STATE $E \leftarrow \emptyset$ \COMMENT{Set of Edges}
\STATE $L \leftarrow [ C~\text{for}~C \in DC ]$ \COMMENT{Arrange cliques in a list}
\FOR{$i \in \{1,\dots,\#DC\}$} % \COMMENT{For every clique}
% %\STATE ~\COMMENT{For sets of cliques exponentially further away from $i$}
\FOR{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots, \lceil \log_2(\#DC) \rceil \} \}$}
% \STATE \COMMENT{Pick the $ns$ closest}
\FOR{$k \in \{0,\dots,ns-1\}$}
% \STATE \COMMENT{Add inter-clique connections in both directions}
\STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
\STATE $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ %\COMMENT{clockwise in ring}
\STATE $E \leftarrow E \cup \{ \{n,m\} \}$
\STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
\STATE $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ %\COMMENT{counter-clockwise in ring}
\STATE $E \leftarrow E \cup \{ \{n,m\} \}$
\ENDFOR
\ENDFOR
\ENDFOR
\RETURN E
\end{algorithmic}
\end{algorithm}
Algorithm~\ref{Algorithm:Smallworld} expects a set of cliques $DC$, previously computed by
Algorithm~\ref{Algorithm:D-Clique-Construction}; a size of neighborhood $ns$,
which is the number of finger edges to add per set of cliques, and a function
\textit{least\_edges}, which given a set of nodes $S$ and an existing set of
edges $E = \{\{i,j\}, \dots \}$, returns one of the nodes in $E$ with the least number of edges. It returns a new set of edges $\{\{i,j\}, \dots \}$ with all edges added by the small-world topology.
The implementation first arranges the cliques of $DC$ in a list, which
represents the ring. Traversing the list with increasing indices is equivalent
to traversing the ring in the clockwise direction, and inversely. Then, for every clique $i$ on the ring from which we are computing the distance to others, a number of edges are added. All other cliques are implicitly arranged in mutually exclusive sets, with size and at offset exponentially bigger (doubling at every step). Then for every of these sets, $ns$ edges are added, both in the clockwise and counter-clockwise directions, always on the nodes with the least number of edges in each clique. The ring edges are implicitly added to the cliques at offset $1$ in both directions.
% \section{Additional Experiments}
% \subsection{Effect of Clique Averaging and Uniform Initialization}
% Section~\ref{section:clique-averaging} explained how Clique Averaging reduces bias and showed that Clique Averaging was significantly beneficial on MNIST with fully-connected D-Cliques. In this section, we provide additional results for the ring topology, as well as for CIFAR10. In addition, during our early exploration, we noticed that ensuring \textit{uniform initialization}, i.e. ensuring that all nodes start with the same model, increased convergence speed when connecting two cliques with 1-2 interclique edges. We therefore also verify whether this effect is still significant with 10 cliques (100 nodes), on a ring and with full connections between cliques, as well as on MNIST and CIFAR10. We also verified what interaction this had with Clique Averaging.
% Figure~\ref{fig:d-cliques-mnist-init-clique-avg-effect} shows all the results for MNIST. Comparing Figure~\ref{fig:d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy} to~\ref{fig:d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy}, and Figure~\ref{fig:d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy} to~\ref{fig:d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy} together, we see that Uniform Initialization has imperceptible effects. However, for all four sub-figures, using Clique Averaging has a slightly better average convergence speed, and significantly lower variance between nodes, than not using it. Moreover, the improvement is larger with Fully-Connected D-Cliques.
% \begin{figure}[htbp]
% \centering
% % To regenerate the figure, from directory results/mnist
% % python ../../../learn-topology/tools/plot_convergence.py clique-ring/all/2021-03-10-18:14:35-CET no-clique-avg/clique-ring/all/2021-03-12-10:40:37-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy} D-Cliques (Ring), with Uniform Initialization}
% \end{subfigure}
% \quad
% % To regenerate the figure, from directory results/mnist
% % python ../../../learn-topology/tools/plot_convergence.py no-init/clique-ring/all/2021-03-12-10:40:11-CET no-init-no-clique-avg/clique-ring/all/2021-03-12-10:41:03-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy} D-Cliques (Ring), without Uniform Initialization}
% \end{subfigure}
% % To regenerate the figure, from directory results/mnist
% %python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:26-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy} D-Cliques (Fully-Connected), with Uniform Initialization}
% \end{subfigure}
% \quad
% % To regenerate the figure, from directory results/mnist
% %python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy} D-Cliques (Fully-Connected), without Uniform Initialization}
% \end{subfigure}
% \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect} MNIST: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=128)}
% \end{figure}
% Figure~\ref{fig:d-cliques-cifar10-init-clique-avg-effect} shows all the results for CIFAR10. One the one hand, with D-Cliques arranged in a ring, uniform initialization has a small but positive effect on convergence speed, whether Clique Averaging is used or not. With fully-connected D-Cliques, the effect is significantly smaller and almost negligible, both with and without Clique Averaging. On the other hand, Clique Averaging is always beneficial, by a significantly larger margin for both interclique topologies and with and without uniform initialization. Moreover, the effect is stronger than for MNIST.
% \begin{figure}[htbp]
% \centering
% % To regenerate the figure, from directory results/cifar10
% % python ../../../learn-topology/tools/plot_convergence.py clique-ring/all/2021-03-10-11:58:43-CET no-init/clique-ring/all/2021-03-13-18:28:30-CET no-clique-avg/clique-ring/all/2021-03-13-18:27:09-CET no-init-no-clique-avg/clique-ring/all/2021-03-13-18:29:58-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg., with uniform init.' 'with clique avg., without uniform init.' 'without clique avg., with uniform init.' 'without clique avg., without uniform init.' --legend 'upper left' --ymax 115 --save-figure ../../figures/d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy.png --font-size 15
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy}
% \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy} D-Cliques (Ring)}
% \end{subfigure}
% % To regenerate the figure, from directory results/cifar10
% %python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET no-clique-avg/fully-connected-cliques/all/2021-03-13-18:31:36-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg., with uniform init.' 'with clique avg., without uniform init.' 'without clique avg., with uniform init.' 'without clique avg., without uniform init.' --legend 'upper left' --ymax 115 --save-figure ../../figures/d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy.png --font-size 15
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy}
% \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy} D-Cliques (Fully-Connected)}
% \end{subfigure}
% \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect} CIFAR10: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=20, momentum=0.9)}
% \end{figure}
% We conclude that Uniform Initialization is not so important for convergence speed but that Clique Averaging is always significantly so.
% \subsection{Comparison to Non-Clustered Topologies}
%
% \begin{figure}
%\centering
% \begin{subfigure}[htb]{0.48\textwidth}
%% To regenerate the figure, from directory results/mnist/gn-lenet
%% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'lower right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymin 80 --yaxis test-accuracy --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies.png
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies}
% \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies} LeNet Model}
% \end{subfigure}
% \hfill
% \begin{subfigure}[htb]{0.48\textwidth}
%% To regenerate the figure, from directory results/mnist/gn-lenet
%% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'upper right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymax 0.7 --yaxis scattering --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering.png
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering}
% \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering} LeNet Model (Scattering)}
% \end{subfigure}
%
% \caption{\label{fig:d-cliques-mnist-comparison-to-non-clustered-topologies} MNIST: Comparison to non-Clustered Topologies}
%\end{figure}
%
% \begin{figure}
% \centering
% % To regenerate the figure, from directory results/cifar10
%% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper right' --yaxis scattering --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering.png --ymax 0.7
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering}
% \caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering} LeNet Model: Scattering}
% \end{subfigure}
%
%\caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies} CIFAR10: Comparison to non-Clustered Topologies}
%\end{figure}
%
%
%\begin{itemize}
% \item Clustering does not seem to make a difference in MNIST, even when using a higher-capacity model (LeNet) instead of a linear model. (Fig.\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies})
% \item Except for the random 10 topology, convergence speed seems to be correlated with scattering in CIFAR-10 with LeNet model (Fig.\ref{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}). There is also more difference between topologies both in convergence speed and scattering than for MNIST (Fig.~\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}). Scattering computed similar to Consensus Control for Decentralized Deep Learning~\cite{consensus_distance}.
%\end{itemize}
%
%
% \clearpage
\section{Additional Experiments on Scaling Behavior with Increasing Number of
Nodes}
\label{app:scaling}
Section~\ref{section:interclique-topologies} compares the convergence speed of various inter-clique topologies at a scale of 1000 nodes. In this section, we show the effect of scaling the number of nodes, by comparing the convergence speed with 1, 10, 100, and 1000 nodes, and adjusting the batch size to maintain a constant number of updates per epoch. We present results for Ring, Fractal, Small-world, and Fully-Connected inter-clique topologies.
Figure~\ref{fig:d-cliques-mnist-scaling-fully-connected} shows the results for
MNIST. For all topologies, we notice a perfect scaling up to 100 nodes, i.e.
the accuracy curves overlap, with low variance between nodes. Starting at 1000
nodes, there is a significant increase in variance between nodes and the
convergence is slower, only marginally for Fully-Connected but
significantly so for Fractal and Ring. Small-world has higher variance between nodes but maintains a convergence speed close to that of Fully-Connected.
Figure~\ref{fig:d-cliques-cifar10-scaling-fully-connected} shows the results
for CIFAR10. When increasing from 1 to 10 nodes (resulting in a single
fully-connected clique), there is actually a small increase both in final
accuracy and convergence speed. We believe this increase is due to the
gradient being computed with exactly the same number of examples from all
classes with 10 fully-connected non-IID nodes, while the gradient for a single
non-IID node may have a slightly larger bias because the random sampling does
not guarantee the representation of all classes perfectly in each batch. At a
scale of 100 nodes, there is no difference between Fully-Connected and
Fractal, as the connections are the same; however, a Ring already shows a
significantly slower convergence. At 1000 nodes, the convergence significantly
slows down for Fractal and Ring, while remaining close, albeit with a larger
variance, for Fully-Connected. Similar to MNIST, Small-world has
higher variance and slightly lower convergence speed than Fully-Connected but
remains very close.
We therefore conclude that Fully-Connected and Small-world have good scaling
properties in terms of convergence speed, and that the
linear-logarithmic number of edges of Small-world makes it the best compromise
between convergence speed and connectivity, and thus the best choice for
efficient large-scale decentralized learning in practice.
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/fully-connected-cliques/all/2021-03-12-09:13:27-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-fully-connected-cst-updates}
\caption{Fully-Connected}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/fully-connected-cliques/all/2021-03-12-09:13:27-CET ../mnist/smallworld-logn-cliques/all/2021-03-23-21:44:56-CET 1000/mnist/smallworld-logn-cliques/all/2021-03-23-21:45:39-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-smallworld-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-smallworld-cst-updates}
\caption{Small-world}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates}
\caption{Fractal}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/clique-ring/all/2021-03-10-18:14:35-CET 1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-clique-ring-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-clique-ring-cst-updates}
\caption{Ring}
\end{subfigure}
\caption{\label{fig:d-cliques-mnist-scaling-fully-connected} MNIST:
D-Cliques scaling behavior (constant updates per epoch) for different
inter-clique topologies.}
\end{figure}
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fully-connected-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-fully-connected-cst-updates}
\caption{Fully-Connected}
\end{subfigure}
\quad
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:23-CET 1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-smallworld-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-smallworld-cst-updates}
\caption{Small-world}
\end{subfigure}
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates}
\caption{Fractal}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/clique-ring/all/2021-03-10-11:58:43-CET 1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-clique-ring-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-clique-ring-cst-updates}
\caption{Ring}
\end{subfigure}
\caption{\label{fig:d-cliques-cifar10-scaling-fully-connected} CIFAR10: D-Cliques scaling behavior (constant updates per epoch) for different
inter-clique topologies.}
\end{figure}
%% % To regenerate the figure, from directory results/scaling
%%% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/fully-connected-cliques/all/2021-03-10-14:40:35-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-10-16:44:35-CET --labels '10 nodes bsz=128' '100 nodes bsz=128' '1000 nodes bsz=128 (45)' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-bsz.png --ymin 80 --add-min-max
% \begin{figure}[htbp]
% \centering
% \includegraphics[width=0.48\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-bsz}
% \caption{FCC: Constant Batch-Size}
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
% \end{figure}
\section{Additional Experiments with Extreme Node Skew}
\label{app:extreme-local-skew}
In this Section, we present additional results for similar experiments as in Section~\ref{section:evaluation} but in the presence of
\textit{extreme local class bias}: we consider that each node only has examples from a single class. This extreme partitioning case provides an upper bound on the effect of label distribution skew suggesting that D-Cliques should perform similarly or better in less extreme cases, as long as a small-enough average skew can be obtained on all cliques. In turn, this helps to provide insights on why D-Cliques work well, as well as to quantify the loss in convergence speed
that may result from using construction algorithms that generate cliques with higher skew.
\subsection{Non-IID assumptions.}
\label{section:non-iid-assumptions}
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on
the same number of nodes; (3) All nodes have the same number of examples.
While less realistic than the assumptions used Section~\ref{section:evaluation},
these assumptions are still reasonable because: (1) Global class imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology; (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions.
These assumptions do make the construction of cliques actually easier by
making it easy to build cliques that have zero skew, as shown in
Section~\ref{section:ideal-cliques}.
\subsection{Constructing Ideal Cliques}
\label{section:ideal-cliques}
Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
for constructing a D-Cliques topology under the assumptions of Section~\ref{section:non-iid-assumptions}.\footnote{An IID
version of D-Cliques, in which each node has an equal number of examples of
all classes, can be implemented by picking $\#L$ nodes per clique at random.}
It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
\begin{algorithm}[h]
\caption{D-Cliques Construction}
\label{Algorithm:D-Clique-Construction}
\begin{algorithmic}[1]
\STATE \textbf{Require:} set of classes globally present $L$,
\STATE~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
\STATE~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
\STATE~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
\STATE~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
\STATE~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$
\STATE $R \leftarrow \{ n~\text{for}~n \in N \}$ \COMMENT{Remaining nodes}
\STATE $DC \leftarrow \emptyset$ \COMMENT{D-Cliques}
\STATE $\textit{C} \leftarrow \emptyset$ \COMMENT{Current Clique}
\WHILE{$R \neq \emptyset$}
\STATE $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
\STATE $R \leftarrow R \setminus \{ n \}$
\STATE $C \leftarrow C \cup \{ n \}$
\IF{$\textit{classes}(C) = L$}
\STATE $DC \leftarrow DC \cup \{ C \}$
\STATE $C \leftarrow \emptyset$
\ENDIF
\ENDWHILE
\RETURN $(weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC)), DC)$
\end{algorithmic}
\end{algorithm}
The implementation builds a single clique by adding nodes with different
classes until all classes of the global distribution are represented. Each
clique is built sequentially until all nodes are parts of cliques.
Because all classes are represented on an equal number of nodes, all cliques
will have nodes of all classes. Furthermore, since nodes have examples
of a single class, we are guaranteed a valid assignment is possible in a greedy manner.
After cliques are created, edges are added and weights are assigned to edges,
using the corresponding input functions.
\subsection{Evaluation}
\label{section:ideal-cliques-evaluation}
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
\subsubsection{Convergence Speed of D-Cliques Compared to Fully-Connected}
% From directory 'results-v2':
% MNIST
% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name fully-connected d-cliques/ideal --nodes:name max-local-skew --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
% python $TOOLS/analyze/diff.py --rundirs all/2021-09-28-12:39:00-CEST-labostrex117 all/2021-09-28-23:18:40-CEST-labostrex118 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 80 --ymax 92.5 --yaxis test-accuracy --labels 'fully-connected' 'd-cliques (fc) w/ cliq-avg' --save-figure ../mlsys2022style/figures/convergence-speed-mnist-dc-fc-vs-fc-1-class-per-node.png --linestyles 'solid' 'dashed'
% CIFAR10
% python $TOOLS/analyze/filter.py all --dataset:name cifar10 --topology:name fully-connected d-cliques/ideal --nodes:name max-local-skew --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
% python $TOOLS/analyze/diff.py --rundirs all/2021-10-03-16:09:21-CEST-labostrex112 all/2021-10-03-19:45:14-CEST-labostrex118 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 0 --ymax 100 --yaxis test-accuracy --labels 'fully-connected w/ momentum' 'd-cliques (fc) w/ cliq-avg and momentum' --save-figure ../mlsys2022style/figures/convergence-speed-cifar10-dc-fc-vs-fc-1-class-per-node.png --linestyles 'solid' 'dashed' --legend 'lower right'
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/convergence-speed-mnist-dc-fc-vs-fc-1-class-per-node}
\caption{\label{fig:convergence-speed-mnist-dc-fc-vs-fc-1-class-per-node} MNIST}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/convergence-speed-cifar10-dc-fc-vs-fc-1-class-per-node}
\caption{\label{fig:convergence-speed-cifar10-dc-fc-vs-fc-1-class-per-node} CIFAR10}
\end{subfigure}
\caption{\label{fig:convergence-speed-dc-vs-fc-1-class-per-node} Convergence Speed of D-Cliques Compared to Fully-Connected on 100 Nodes (1 class/node).}
\end{figure}
\subsubsection{Effect of Removing Intra-clique Edges}
% From directory 'results-v2':
% MNIST
% python $TOOLS/analyze/filter.py all --dataset:name mnist --topology:name d-cliques/ideal --nodes:name max-local-skew --meta:seed 1 --nodes:nb-nodes 100 | python $TOOLS/analyze/diff.py
% w/o Clique Gradient
% python $TOOLS/analyze/diff.py --rundirs all/2021-09-29-03:52:47-CEST-labostrex118 all/2021-10-02-21:26:18-CEST-labostrex113 all/2021-10-03-06:33:52-CEST-labostrex113 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-ideal-wo-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18
% w/ Clique Gradient
% python $TOOLS/analyze/diff.py --rundirs all/2021-09-28-23:18:40-CEST-labostrex118 all/2021-10-02-16:50:53-CEST-labostrex113 all/2021-10-03-02:00:23-CEST-labostrex113 --pass-through | python $TOOLS/plot/convergence.py --add-min-max --ymin 89 --ymax 92.5 --yaxis test-accuracy --labels 'full intra-connectivity' '-1 edge/clique' '-5 edges/clique' --save-figure ../mlsys2022style/figures/d-cliques-ideal-w-clique-avg-impact-of-edge-removal.png --linestyles 'solid' 'dashed' 'dotted' --font-size 18
\begin{figure}[t]
\centering
\begin{subfigure}[htbp]{0.23\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/d-cliques-ideal-wo-clique-avg-impact-of-edge-removal}
\caption{\label{fig:d-cliques-ideal-wo-clique-avg-impact-of-edge-removal} Without Clique Averaging }
\end{subfigure}
\hfill
\begin{subfigure}[htbp]{0.23\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/d-cliques-ideal-w-clique-avg-impact-of-edge-removal}
\caption{\label{fig:d-cliques-ideal-w-clique-avg-impact-of-edge-removal} With Clique Averaging}
\end{subfigure}
\caption{\label{fig:d-cliques-ideal-intra-connectivity} Impact of Intra-clique Connectivity Failures on MNIST (1 class/node). Y axis starts at 89.}
\end{figure}