Newer
Older
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
network \cite{lian2017d-psgd,Lian2018}. Recent work
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the IID case. However, these results do not give any clear insight
regarding the role of the topology in the non-IID case. We note that some work
has gone into designing efficient topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but the topology is chosen
independently of how data is distributed across nodes. In summary, the role
of topology in the non-IID data scenario is not well understood and we are not
aware of prior work focusing on this question. Our work is the first
to show that an
appropriate choice of data-dependent topology can effectively compensate for
non-IID data.
\section{Conclusion}
\label{section:conclusion}
We proposed D-Cliques, a sparse topology that recovers the convergence
speed of a fully-connected network in the presence of local class bias.
D-Cliques is based on assembling subsets of nodes into cliques such
that the clique-level class distribution is representative of the global
distribution, thereby locally recovering IIDness. Cliques are joined in a
sparse inter-clique topology so that
they quickly converge to the same model. We proposed Clique
Averaging to remove the non-IID bias in gradient computation by
averaging gradients only with other nodes within the clique. Clique Averaging
can in turn be used to implement unbiased momentum to recover the convergence
speed usually only possible with IID mini-batches. Through our experiments, we
showed that the clique structure of D-Cliques is critical in obtaining these
results and that a small-world inter-clique topology with only $O(n \log (n))$
edges achieves the best compromise between
convergence speed and scalability with the number of nodes.
D-Cliques thus appears to be very promising to reduce bandwidth
usage on FL servers and to implement fully decentralized alternatives in a
wider range of applications where global coordination is impossible or costly.
For instance, the presence and relative frequency of classes in each node
could be computed using PushSum~\cite{kempe2003gossip}, and the topology could
be constructed in a decentralized and adaptive way with
PeerSampling~\cite{jelasity2007gossip}. This will be investigated in future work.
We also believe that our ideas can be useful to deal
with more general types of data non-IIDness beyond the important case of
local class bias that we studied in this paper. An important example is
covariate shift or feature distribution skew \cite{kairouz2019advances}, for
which local density estimates could be used as basis to construct cliques that
approximately recover the global distribution.
\bibliography{../main.bib}
\bibliographystyle{mlsys2022}
\appendix
\section{Detailed Algorithms}
We present a more detailed and precise explanation the algorithm to establish a small-world
inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}).
% \subsection{D-Cliques Construction}
%
% Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
% for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
% version of D-Cliques, in which each node has an equal number of examples of
% all classes, can be implemented by picking $\#L$ nodes per clique at random.}
% It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
% \begin{algorithm}[h]
% \caption{D-Cliques Construction}
% \label{Algorithm:D-Clique-Construction}
% \begin{algorithmic}[1]
% \State \textbf{Require:} set of classes globally present $L$,
% \State~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
% \State~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
% \State~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
% \State~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
% \State~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$
%
% \State $R \leftarrow \{ n~\text{for}~n \in N \}$ \Comment{Remaining nodes}
% \State $DC \leftarrow \emptyset$ \Comment{D-Cliques}
% \State $\textit{C} \leftarrow \emptyset$ \Comment{Current Clique}
% \While{$R \neq \emptyset$}
% \State $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
% \State $R \leftarrow R \setminus \{ n \}$
% \State $C \leftarrow C \cup \{ n \}$
% \If{$\textit{classes}(C) = L$}
% \State $DC \leftarrow DC \cup \{ C \}$
% \State $C \leftarrow \emptyset$
% \EndIf
% \EndWhile
% \State \Return $(weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC)), DC)$
% \end{algorithmic}
%\end{algorithm}
The implementation builds a single clique by adding nodes with different
classes until all classes of the global distribution are represented. Each
clique is built sequentially until all nodes are parts of cliques.
Because all classes are represented on an equal number of nodes, all cliques
will have nodes of all classes. Furthermore, since nodes have examples
of a single class, we are guaranteed a valid assignment is possible in a greedy manner. After cliques are created, edges are added and weights are assigned to edges, using the corresponding input functions.
\subsection{Small-world Inter-clique Topology}
Algorithm~\ref{Algorithm:Smallworld} instantiates the function
\textit{interconnect} with a
small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
\begin{algorithm}[h]
\caption{$\textit{smallworld}(DC)$: adds $O(\# N \log(\# N))$ edges}
\label{Algorithm:Smallworld}
\begin{algorithmic}[1]
\STATE \textbf{Require:} set of cliques $DC$ (set of set of nodes)
\STATE ~~size of neighborhood $ns$ (default 2)
\STATE ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
\STATE $E \leftarrow \emptyset$ \COMMENT{Set of Edges}
\STATE $L \leftarrow [ C~\text{for}~C \in DC ]$ \COMMENT{Arrange cliques in a list}
\FOR{$i \in \{1,\dots,\#DC\}$} % \COMMENT{For every clique}
% %\STATE ~\COMMENT{For sets of cliques exponentially further away from $i$}
\FOR{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots, \lceil \log_2(\#DC) \rceil \} \}$}
% \STATE \COMMENT{Pick the $ns$ closest}
\FOR{$k \in \{0,\dots,ns-1\}$}
% \STATE \COMMENT{Add inter-clique connections in both directions}
\STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
\STATE $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ %\COMMENT{clockwise in ring}
\STATE $E \leftarrow E \cup \{ \{n,m\} \}$
\STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
\STATE $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ %\COMMENT{counter-clockwise in ring}
\STATE $E \leftarrow E \cup \{ \{n,m\} \}$
\ENDFOR
\ENDFOR
\ENDFOR
\RETURN E
\end{algorithmic}
\end{algorithm}
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
Algorithm~\ref{Algorithm:Smallworld} expects a set of cliques $DC$, previously computed by
Algorithm~\ref{Algorithm:D-Clique-Construction}; a size of neighborhood $ns$,
which is the number of finger edges to add per set of cliques, and a function
\textit{least\_edges}, which given a set of nodes $S$ and an existing set of
edges $E = \{\{i,j\}, \dots \}$, returns one of the nodes in $E$ with the least number of edges. It returns a new set of edges $\{\{i,j\}, \dots \}$ with all edges added by the small-world topology.
The implementation first arranges the cliques of $DC$ in a list, which
represents the ring. Traversing the list with increasing indices is equivalent
to traversing the ring in the clockwise direction, and inversely. Then, for every clique $i$ on the ring from which we are computing the distance to others, a number of edges are added. All other cliques are implicitly arranged in mutually exclusive sets, with size and at offset exponentially bigger (doubling at every step). Then for every of these sets, $ns$ edges are added, both in the clockwise and counter-clockwise directions, always on the nodes with the least number of edges in each clique. The ring edges are implicitly added to the cliques at offset $1$ in both directions.
% \section{Additional Experiments}
% \subsection{Effect of Clique Averaging and Uniform Initialization}
% Section~\ref{section:clique-averaging} explained how Clique Averaging reduces bias and showed that Clique Averaging was significantly beneficial on MNIST with fully-connected D-Cliques. In this section, we provide additional results for the ring topology, as well as for CIFAR10. In addition, during our early exploration, we noticed that ensuring \textit{uniform initialization}, i.e. ensuring that all nodes start with the same model, increased convergence speed when connecting two cliques with 1-2 interclique edges. We therefore also verify whether this effect is still significant with 10 cliques (100 nodes), on a ring and with full connections between cliques, as well as on MNIST and CIFAR10. We also verified what interaction this had with Clique Averaging.
% Figure~\ref{fig:d-cliques-mnist-init-clique-avg-effect} shows all the results for MNIST. Comparing Figure~\ref{fig:d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy} to~\ref{fig:d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy}, and Figure~\ref{fig:d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy} to~\ref{fig:d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy} together, we see that Uniform Initialization has imperceptible effects. However, for all four sub-figures, using Clique Averaging has a slightly better average convergence speed, and significantly lower variance between nodes, than not using it. Moreover, the improvement is larger with Fully-Connected D-Cliques.
% \begin{figure}[htbp]
% \centering
% % To regenerate the figure, from directory results/mnist
% % python ../../../learn-topology/tools/plot_convergence.py clique-ring/all/2021-03-10-18:14:35-CET no-clique-avg/clique-ring/all/2021-03-12-10:40:37-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy} D-Cliques (Ring), with Uniform Initialization}
% \end{subfigure}
% \quad
% % To regenerate the figure, from directory results/mnist
% % python ../../../learn-topology/tools/plot_convergence.py no-init/clique-ring/all/2021-03-12-10:40:11-CET no-init-no-clique-avg/clique-ring/all/2021-03-12-10:41:03-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-no-init-clique-avg-effect-ring-test-accuracy} D-Cliques (Ring), without Uniform Initialization}
% \end{subfigure}
% % To regenerate the figure, from directory results/mnist
% %python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:26-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy} D-Cliques (Fully-Connected), with Uniform Initialization}
% \end{subfigure}
% \quad
% % To regenerate the figure, from directory results/mnist
% %python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg.' 'without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy.png
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy}
% \caption{\label{fig:d-cliques-mnist-no-init-clique-avg-effect-fcc-test-accuracy} D-Cliques (Fully-Connected), without Uniform Initialization}
% \end{subfigure}
% \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect} MNIST: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=128)}
% \end{figure}
% Figure~\ref{fig:d-cliques-cifar10-init-clique-avg-effect} shows all the results for CIFAR10. One the one hand, with D-Cliques arranged in a ring, uniform initialization has a small but positive effect on convergence speed, whether Clique Averaging is used or not. With fully-connected D-Cliques, the effect is significantly smaller and almost negligible, both with and without Clique Averaging. On the other hand, Clique Averaging is always beneficial, by a significantly larger margin for both interclique topologies and with and without uniform initialization. Moreover, the effect is stronger than for MNIST.
% \begin{figure}[htbp]
% \centering
% % To regenerate the figure, from directory results/cifar10
% % python ../../../learn-topology/tools/plot_convergence.py clique-ring/all/2021-03-10-11:58:43-CET no-init/clique-ring/all/2021-03-13-18:28:30-CET no-clique-avg/clique-ring/all/2021-03-13-18:27:09-CET no-init-no-clique-avg/clique-ring/all/2021-03-13-18:29:58-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg., with uniform init.' 'with clique avg., without uniform init.' 'without clique avg., with uniform init.' 'without clique avg., without uniform init.' --legend 'upper left' --ymax 115 --save-figure ../../figures/d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy.png --font-size 15
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy}
% \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy} D-Cliques (Ring)}
% \end{subfigure}
% % To regenerate the figure, from directory results/cifar10
% %python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET no-clique-avg/fully-connected-cliques/all/2021-03-13-18:31:36-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg., with uniform init.' 'with clique avg., without uniform init.' 'without clique avg., with uniform init.' 'without clique avg., without uniform init.' --legend 'upper left' --ymax 115 --save-figure ../../figures/d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy.png --font-size 15
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy}
% \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy} D-Cliques (Fully-Connected)}
% \end{subfigure}
% \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect} CIFAR10: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=20, momentum=0.9)}
% \end{figure}
% We conclude that Uniform Initialization is not so important for convergence speed but that Clique Averaging is always significantly so.
% \subsection{Comparison to Non-Clustered Topologies}
%
% \begin{figure}
%\centering
% \begin{subfigure}[htb]{0.48\textwidth}
%% To regenerate the figure, from directory results/mnist/gn-lenet
%% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'lower right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymin 80 --yaxis test-accuracy --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies.png
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies}
% \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies} LeNet Model}
% \end{subfigure}
% \hfill
% \begin{subfigure}[htb]{0.48\textwidth}
%% To regenerate the figure, from directory results/mnist/gn-lenet
%% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'upper right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymax 0.7 --yaxis scattering --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering.png
% \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering}
% \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering} LeNet Model (Scattering)}
% \end{subfigure}
%
% \caption{\label{fig:d-cliques-mnist-comparison-to-non-clustered-topologies} MNIST: Comparison to non-Clustered Topologies}
%\end{figure}
%
% \begin{figure}
% \centering
% % To regenerate the figure, from directory results/cifar10
%% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper right' --yaxis scattering --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering.png --ymax 0.7
% \begin{subfigure}[b]{0.48\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering}
% \caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering} LeNet Model: Scattering}
% \end{subfigure}
%
%\caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies} CIFAR10: Comparison to non-Clustered Topologies}
%\end{figure}
%
%
%\begin{itemize}
% \item Clustering does not seem to make a difference in MNIST, even when using a higher-capacity model (LeNet) instead of a linear model. (Fig.\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies})
% \item Except for the random 10 topology, convergence speed seems to be correlated with scattering in CIFAR-10 with LeNet model (Fig.\ref{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}). There is also more difference between topologies both in convergence speed and scattering than for MNIST (Fig.~\ref{fig:d-cliques-mnist-comparison-to-non-clustered-topologies}). Scattering computed similar to Consensus Control for Decentralized Deep Learning~\cite{consensus_distance}.
%\end{itemize}
%
%
% \clearpage
\section{Additional Experiments on Scaling Behavior with Increasing Number of
Nodes}
\label{app:scaling}
Section~\ref{section:interclique-topologies} compares the convergence speed of various inter-clique topologies at a scale of 1000 nodes. In this section, we show the effect of scaling the number of nodes, by comparing the convergence speed with 1, 10, 100, and 1000 nodes, and adjusting the batch size to maintain a constant number of updates per epoch. We present results for Ring, Fractal, Small-world, and Fully-Connected inter-clique topologies.
Figure~\ref{fig:d-cliques-mnist-scaling-fully-connected} shows the results for
MNIST. For all topologies, we notice a perfect scaling up to 100 nodes, i.e.
the accuracy curves overlap, with low variance between nodes. Starting at 1000
nodes, there is a significant increase in variance between nodes and the
convergence is slower, only marginally for Fully-Connected but
significantly so for Fractal and Ring. Small-world has higher variance between nodes but maintains a convergence speed close to that of Fully-Connected.
Figure~\ref{fig:d-cliques-cifar10-scaling-fully-connected} shows the results
for CIFAR10. When increasing from 1 to 10 nodes (resulting in a single
fully-connected clique), there is actually a small increase both in final
accuracy and convergence speed. We believe this increase is due to the
gradient being computed with exactly the same number of examples from all
classes with 10 fully-connected non-IID nodes, while the gradient for a single
non-IID node may have a slightly larger bias because the random sampling does
not guarantee the representation of all classes perfectly in each batch. At a
scale of 100 nodes, there is no difference between Fully-Connected and
Fractal, as the connections are the same; however, a Ring already shows a
significantly slower convergence. At 1000 nodes, the convergence significantly
slows down for Fractal and Ring, while remaining close, albeit with a larger
variance, for Fully-Connected. Similar to MNIST, Small-world has
higher variance and slightly lower convergence speed than Fully-Connected but
remains very close.
We therefore conclude that Fully-Connected and Small-world have good scaling
properties in terms of convergence speed, and that the
linear-logarithmic number of edges of Small-world makes it the best compromise
between convergence speed and connectivity, and thus the best choice for
efficient large-scale decentralized learning in practice.
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/fully-connected-cliques/all/2021-03-12-09:13:27-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-fully-connected-cst-updates}
\caption{Fully-Connected}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/fully-connected-cliques/all/2021-03-12-09:13:27-CET ../mnist/smallworld-logn-cliques/all/2021-03-23-21:44:56-CET 1000/mnist/smallworld-logn-cliques/all/2021-03-23-21:45:39-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-smallworld-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-smallworld-cst-updates}
\caption{Small-world}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates}
\caption{Fractal}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../mnist/1-node-iid/all/2021-03-10-09:20:03-CET 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/clique-ring/all/2021-03-10-18:14:35-CET 1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET --labels '1 node IID bsz=12800' '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-clique-ring-cst-updates.png --ymin 80 --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-scaling-clique-ring-cst-updates}
\caption{Ring}
\end{subfigure}
\caption{\label{fig:d-cliques-mnist-scaling-fully-connected} MNIST:
D-Cliques scaling behavior (constant updates per epoch) for different
inter-clique topologies.}
\end{figure}
\begin{figure}[htbp]
\centering
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fully-connected-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-fully-connected-cst-updates}
\caption{Fully-Connected}
\end{subfigure}
\quad
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:23-CET 1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-smallworld-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-smallworld-cst-updates}
\caption{Small-world}
\end{subfigure}
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates}
\caption{Fractal}
\end{subfigure}
\quad
% To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/clique-ring/all/2021-03-10-11:58:43-CET 1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-clique-ring-cst-updates.png --add-min-max
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-scaling-clique-ring-cst-updates}
\caption{Ring}
\end{subfigure}
\caption{\label{fig:d-cliques-cifar10-scaling-fully-connected} CIFAR10: D-Cliques scaling behavior (constant updates per epoch) for different
inter-clique topologies.}
\end{figure}
%% % To regenerate the figure, from directory results/scaling
%%% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/fully-connected-cliques/all/2021-03-10-14:40:35-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-10-16:44:35-CET --labels '10 nodes bsz=128' '100 nodes bsz=128' '1000 nodes bsz=128 (45)' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-bsz.png --ymin 80 --add-min-max
% \begin{figure}[htbp]
% \centering
% \includegraphics[width=0.48\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-bsz}
% \caption{FCC: Constant Batch-Size}
% \end{figure}
\end{document}