@@ -183,7 +183,7 @@ model with their neighbors. In this paper, we address the following question:
...
@@ -183,7 +183,7 @@ model with their neighbors. In this paper, we address the following question:
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network graph under
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
%AMK: do we talk about local class bias or noniidness?
...
@@ -657,19 +657,45 @@ with the centralized setting.
...
@@ -657,19 +657,45 @@ with the centralized setting.
\label{section:non-clustered}
\label{section:non-clustered}
%AMK: add what is in there
%AMK: add what is in there
In this section, we first compare D-Cliques to alternatives, we then further evaluate the impact of using Clique-Averaging and evaluate D-Cliques-based extensions.
In this section, we first compare D-Cliques to alternative topologies to
confirm our main design choices. Then,
we evaluate some extensions of D-Cliques to further reduce the number of
inter-clique connections so as to scale even better with the number of nodes.
\subsection{Comparing D-Cliques to alternatives}%Non-Clustered Topologies}
\subsection{Comparing D-Cliques to Other Sparse Topologies}%Non-Clustered
% Topologies}
%\label{section:non-clustered}
%\label{section:non-clustered}
%We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
%We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
We compare D-cliques against competitors to demonstrate its advantages over alternative topologies.
We demonstrate the advantages of D-cliques over alternative sparse topologies
First, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbors of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbors. In the latter case, since nodes do not form a clique, no node actually compute the same resulting average gradient.
with a
similar number of edges. First, we consider topologies where the neighbors
of each node are selected at random without any clique structure.
Specifically, for $n=100$ nodes, we
construct a random topology such that each node has exactly 10 edges, which is
similar to the average 9.9 edges for our previous D-Cliques example
(Fig.~\ref{fig:d-cliques-figure}). To better understand the importance of the
clique structure independently of the class representativity among neighbors,
we also compare to a similar random topology
where edges are
chosen such that each node has neighbors of all possible classes. Finally, we
also implement an analog of Clique Averaging for these random topologies,
where all nodes de-bias their gradient with that of their neighbors. In the
latter case, since nodes do not form a clique, none of the nodes actually
computes the same resulting average gradient.
The results for MNIST and CIFAR10 are shown in
Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST,
a purely random topology has higher variance and lower convergence speed than
D-Cliques, with or without Clique Averaging, while a random topology with
class representativity performs similarly as D-Cliques without Clique
Averaging. However and perhaps surprisingly, a random topology with unbiased
gradient performs slightly worse than without it. In any case, D-Cliques with
Clique Averaging performs better than any other random topology, showing
that the clique structure has a small but significant effect in this setup.
Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems that clustering in this case has a small but significant effect.
\begin{figure}[t]
\begin{figure}[htbp]
\centering
\centering
\begin{subfigure}[b]{0.48\textwidth}
\begin{subfigure}[b]{0.48\textwidth}
% To regenerate the figure, from directory results/mnist
% To regenerate the figure, from directory results/mnist
...
@@ -689,15 +715,34 @@ Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-
...
@@ -689,15 +715,34 @@ Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-
\caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies}
\caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies}
\end{figure}
\end{figure}
For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
On the harder CIFAR10 dataset, the differences are much more dramatic:
D-Cliques with Clique Averaging and momentum is critical for good convergence.
\subsection{Importance of Intra-Clique Full Connectivity}
Crucially, all random topologies fail to converge to a good solution. This
\label{section:intra-clique-connectivity}
confirms that our clique structure is important to reduce variance
across nodes and improve the convergence. The difference with the previous
Intra-clique full connectivity is also necessary.
experiment seems to be due to both the use of a higher capacity model with
local optima and to the intrinsic characteristics of the datasets. We refer
to the appendix for results on MNIST with LeNet.
% We have tried to use LeNet on
% MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
While the previous experiments suggest that our clique structure is
instrumental in obtaining good performance, one may wonder whether
intra-clique full connectivity is actually necessary.
%AMK: check sentence above: justify
%AMK: check sentence above: justify
Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-Cliques with respectively 1 and 5 edges randomly removed, out of 45 (2 and 10 out of 90 if counting both direction separately), as well as with and without Clique Averaging (resulting in a biased average gradient within cliques). In all cases, both for MNIST and CIFAR10, it has significant effect on the convergence speed. In the case of CIFAR10, it also negates the benefits of D-Cliques.
Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of
\begin{figure}[htbp]
D-Cliques when cliques have been sparsified by randomly removing 1
or
5 edges per clique (out of 45). Strikingly, both for MNIST and
CIFAR10, sparsifying the cliques even slightly has significant effect on the
convergence speed. In the case of CIFAR10, it even entirely negates the
benefits of D-Cliques.
Overall, these experiments show that achieving fast convergence on non-IID
data with sparse topologies requires a very careful design, as we have
proposed with D-Cliques.
\begin{figure}[t]
\centering
\centering
\begin{subfigure}[htbp]{0.48\textwidth}
\begin{subfigure}[htbp]{0.48\textwidth}
...
@@ -721,23 +766,62 @@ Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-C
...
@@ -721,23 +766,62 @@ Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-C
\end{figure}
\end{figure}
%\section{Scaling with Different Inter-Clique Topologies}
%\section{Scaling with Different Inter-Clique Topologies}
\subsection{Scaling with D-Cliques extensions}
\subsection{Scaling up with D-Cliques Extensions}
%with Different Inter-Clique Topologies}
%with Different Inter-Clique Topologies}
\label{section:interclique-topologies}
\label{section:interclique-topologies}
We finally evaluate the effect of the inter-clique topology on convergence speed on a larger network of 1000 nodes. We compare the scalability and convergence speed of variants based on D-Cliques, and therefore all using $O(nc)$ edges to create cliques as a foundation, where $n$ is the number of nodes and $c$ is the size of a clique.
First, the scheme that uses the fewest (almost\footnote{A path uses one less edge at significantly slower convergence speed and is therefore never really used in practice.}) number of extra edges is a \textit{ring}. A ring adds $\frac{n}{c}-1$ inter-clique edges and therefore scales linearly in $O(n)$.
We introduce a second scheme that scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, as nodes are added, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
Third, we propose to connect cliques according to a smallworld-like~\cite{watts2000small} topology, applied to a ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a ring. Then each clique add symmetric edges, both clockwise and counter-clockwise on the ring, to the $ns$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring.\footnote{See Algorithm~\ref{Algorithm:Smallworld} in Appendix for a detailed listing.} This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small. This scheme adds a $2(ns)log(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes.
Finally, we also fully connect cliques together, which bounds the average shortest path to $2$ between any pair of nodes. This adds $\frac{n}{c}(\frac{n}{c}-1)$ edges, which scales quadratically in the number of nodes, in $O(\frac{n^2}{c^2})$, which can be significant at larger scales when $n$ is large compared to $c$.
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which bounds the
average shortest path to $2$ between any pair of nodes. This uses $\frac{n}{c}
(\frac{n}{c}-1)$ inter-clique edges, which scales quadratically in the
number of nodes. This can become significant at larger scales when $n$ is
large compared to $c$.
In this last series of experiment, we evaluate the effect of the choice of
inter-clique topology on the convergence speed for a larger network of 1000
nodes. We compare the scalability and convergence speed of several
D-Cliques variants, which all use $O(nc)$ edges
to create cliques as a starting point.
The inter-clique topology with (almost) fewest edges is a \textit{ring}, which
uses $\frac{n}{c}-1$ inter-clique edges and therefore scales linearly in $O
(n)$.
Another topology scales linearly with a logarithmic bound on the
average shortest number of hops between two nodes: we call it
\textit{fractal}. In this hierarchical scheme, cliques are
assembled in
larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group at the next level up. This results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
Finally, we propose to connect cliques according to a
small-world-like~\cite{watts2000small} topology applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $ns$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good clustering with other
cliques that are close on the ring, while still keeping the average shortest
path small. This scheme uses $2(ns)log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n + log(n))$ with the number of nodes.
Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence speed
for all schemes on MNIST and CIFAR10, compared to the ideal baseline of a
single IID node performing the same number of updates per epoch (showing the
fastest convergence speed achievable if topology had no impact). The ring
topology converges but is much slower, while our fractal scheme helps
significantly. The sweet spot appears to be with the small-world
topology, as the convergence speed is almost the same as with a
fully-connected inter-clique topology but with 22\% less edges
(14.5 edges on average instead of 18.9). We can also expect bigger gains at
larger scales. Nonetheless, even the fully-connected topology offers
significant benefits with 1000 nodes, as it represents a 98\% reduction in the
number of edges compared to fully connecting individual nodes (18.9 edges on
average instead of 999) and a 96\% reduction in the number of messages (37.8
messages per round per node on average instead of 999). Overall, these results
show that D-Cliques can scale nicely with the number of nodes.
Figure~\ref{fig:d-cliques-cifar10-convolutional} shows convergence speeds for all schemes, both on MNIST and CIFAR10, compared to a single IID node performing the same number of updates per epoch (showing the fastest convergence speed achievable if topology had no impact). A ring converges but is much slower. Our "fractal" scheme helps significantly. But the sweet spot really seems to be with a smallworld topology, as the convergence speed is almost the same to a fully-connected topology, but uses 22\% less edges at that scale (14.5 edges on average instead of 18.9), and seems to have potential to have larger benefits at larger scales. Nonetheless, even the fully-connected topology offers significant benefits with 1000 nodes, as it represents a 98\% reduction in the number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96\% reduction in the number of messages (37.8 messages per round per node on average instead of 999).
\begin{figure}[t]
\begin{figure}[htbp]
\centering
\centering
% To regenerate the figure, from directory results/mnist
% To regenerate the figure, from directory results/mnist
However, for IID data, practice contradicts these classic
However, for IID data, practice contradicts these classic
results: fully decentralized algorithms converge essentially as fast
results: fully decentralized algorithms converge essentially as fast
on sparse topologies like rings or grids as they do on a fully connected
on sparse topologies like rings or grids as they do on a fully connected
graph\cite{lian2017d-psgd,Lian2018}. Recent work
network\cite{lian2017d-psgd,Lian2018}. Recent work
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the IID case. However, these results do not give any clear insight
smaller in the IID case. However, these results do not give any clear insight
regarding the role of the topology in the non-IID case. We note that some work
regarding the role of the topology in the non-IID case. We note that some work