Skip to content
Snippets Groups Projects
Commit 7dc1ba40 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

sec 6

parent dc012f43
No related branches found
No related tags found
No related merge requests found
......@@ -677,9 +677,10 @@ with the centralized setting.
%AMK: add what is in there
In this section, we first compare D-Cliques to alternative topologies to
confirm our main design choices. Then,
confirm the relevance of our main design choices. Then,
we evaluate some extensions of D-Cliques to further reduce the number of
inter-clique connections so as to scale even better with the number of nodes.
inter-clique connections so as to gracefully scale with the number of
nodes.
\subsection{Comparing D-Cliques to Other Sparse Topologies} %Non-Clustered
% Topologies}
......@@ -688,31 +689,33 @@ inter-clique connections so as to scale even better with the number of nodes.
%We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
We demonstrate the advantages of D-cliques over alternative sparse topologies
with a
similar number of edges. First, we consider topologies where the neighbors
of each node are selected at random without any clique structure.
that have a similar number of edges. First, we consider topologies in which
the neighbors of each node are selected at random (hence without any clique
structure).
Specifically, for $n=100$ nodes, we
construct a random topology such that each node has exactly 10 edges, which is
similar to the average 9.9 edges for our previous D-Cliques example
(Fig.~\ref{fig:d-cliques-figure}). To better understand the importance of the
clique structure independently of the class representativity among neighbors,
we also compare to a similar random topology
where edges are
similar to the average 9.9 edges of our D-Cliques topology
(Figure~\ref{fig:d-cliques-figure}). To better understand the role of
the clique structure beyond merely ensuring class representativity among
neighbors,
we also compare to a random topology similar to the one described above except
that edges are
chosen such that each node has neighbors of all possible classes. Finally, we
also implement an analog of Clique Averaging for these random topologies,
where all nodes de-bias their gradient with that of their neighbors. In the
latter case, since nodes do not form a clique, none of the nodes actually
computes the same resulting average gradient.
where all nodes de-bias their gradient based on the class distribution of
their neighbors. In the latter case, since nodes do not form a clique, each
node obtains a different average gradient.
The results for MNIST and CIFAR10 are shown in
Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST,
a purely random topology has higher variance and lower convergence speed than
D-Cliques, with or without Clique Averaging, while a random topology with
D-Cliques (with or without Clique Averaging), while a random topology with
class representativity performs similarly as D-Cliques without Clique
Averaging. However and perhaps surprisingly, a random topology with unbiased
gradient performs slightly worse than without it. In any case, D-Cliques with
Clique Averaging performs better than any other random topology, showing
that the clique structure has a small but significant effect in this setup.
Clique Averaging outperforms all random topologies, showing that the clique
structure has a small but noticeable effect on the average accuracy and
significantly reduces the variance across nodes in this setup.
\begin{figure}[t]
\centering
......@@ -734,13 +737,15 @@ that the clique structure has a small but significant effect in this setup.
\caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies}
\end{figure}
On the harder CIFAR10 dataset, the differences are much more dramatic:
D-Cliques with Clique Averaging and momentum is critical for good convergence.
On the harder CIFAR10 dataset with a deep convolutional network, the
differences are much more dramatic:
D-Cliques with Clique Averaging and momentum turns out to be critical for fast
convergence.
Crucially, all random topologies fail to converge to a good solution. This
confirms that our clique structure is important to reduce variance
across nodes and improve the convergence. The difference with the previous
experiment seems to be due to both the use of a higher capacity model with
local optima and to the intrinsic characteristics of the datasets. We refer
experiment seems to be due to both the use of a higher capacity model and to
the intrinsic characteristics of the datasets. We refer
to the appendix for results on MNIST with LeNet.
% We have tried to use LeNet on
% MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
......@@ -750,14 +755,14 @@ instrumental in obtaining good performance, one may wonder whether
intra-clique full connectivity is actually necessary.
%AMK: check sentence above: justify
Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of
D-Cliques when cliques have been sparsified by randomly removing 1
or
5 edges per clique (out of 45). Strikingly, both for MNIST and
CIFAR10, sparsifying the cliques even slightly has significant effect on the
convergence speed. In the case of CIFAR10, it even entirely negates the
a D-Cliques topology where cliques have been sparsified by randomly
removing 1 or 5 edges per clique (out of 45). Strikingly, both for MNIST and
CIFAR10, removing just a single edge from the cliques has a
significant effect on the
convergence speed. On CIFAR10, it even entirely negates the
benefits of D-Cliques.
Overall, these experiments show that achieving fast convergence on non-IID
Overall, these results show that achieving fast convergence on non-IID
data with sparse topologies requires a very careful design, as we have
proposed with D-Cliques.
......@@ -785,60 +790,69 @@ proposed with D-Cliques.
\end{figure}
%\section{Scaling with Different Inter-Clique Topologies}
\subsection{Scaling up with D-Cliques Extensions}
\subsection{Scaling up D-Cliques with Sparser Inter-Clique Topologies}
%with Different Inter-Clique Topologies}
\label{section:interclique-topologies}
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which bounds the
average shortest path to $2$ between any pair of nodes. This uses $\frac{n}{c}
(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically in the
number of nodes. This can become significant at larger scales when $n$ is
which has the advantage of bounding the
average shortest path to $2$ between any pair of nodes. This choice requires $
\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
in the number of nodes. This can become significant at larger scales when $n$ is
large compared to $c$.
In this last series of experiment, we evaluate the effect of the choice of
inter-clique topology on the convergence speed for a larger network of 1000
In this last series of experiment, we evaluate the effect of choosing sparser
inter-clique topologies on the convergence speed for a larger network of 1000
nodes. We compare the scalability and convergence speed of several
D-Cliques variants, which all use $O(nc)$ edges
to create cliques as a starting point.
The inter-clique topology with (almost) fewest edges is a \textit{ring}, which
uses $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in $O
(n)$.
Another topology scales linearly with a logarithmic bound on the
average shortest number of hops between two nodes: we call it
\textit{fractal}. In this hierarchical scheme, cliques are
The inter-clique topology with (almost) fewest possible edges is a
\textit{ring}, which
uses $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in
$n$.
We also consider another topology that scales linearly and achieves a
logarithmic
bound on the
average shortest number of hops between two nodes. In this hierarchical scheme
that we call \textit{fractal}, cliques are
assembled in
larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group at the next level up. This results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
larger groups of $c$ cliques that are connected internally with one edge per
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $c$ groups will themselves form a
larger group at the next level up. This results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
Finally, we propose to connect cliques according to a
small-world-like~\cite{watts2000small} topology applied on top of a
small-world-like topology~\cite{watts2000small} applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $ns$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good clustering with other
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average shortest
path small. This scheme uses $2(ns)log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n + log(n))$ with the number of nodes.
Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence speed
for all schemes on MNIST and CIFAR10, compared to the ideal baseline of a
single IID node performing the same number of updates per epoch (showing the
fastest convergence speed achievable if topology had no impact). The ring
Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence
speed of all the above schemes on MNIST and CIFAR10, compared to the ideal
baseline
of a
single IID node performing the same number of updates per epoch (representing
the fastest convergence speed achievable if topology had no impact). The ring
topology converges but is much slower, while our fractal scheme helps
significantly. The sweet spot appears to be with the small-world
significantly. The sweet spot appears to be the small-world
topology, as the convergence speed is almost the same as with a
fully-connected inter-clique topology but with 22\% less edges
(14.5 edges on average instead of 18.9). We can also expect bigger gains at
larger scales. Nonetheless, even the fully-connected topology offers
(14.5 edges on average instead of 18.9). Note that we can expect bigger
gains at larger scales. Nonetheless, we stress the fact that even the
fully-connected topology offers
significant benefits with 1000 nodes, as it represents a 98\% reduction in the
number of edges compared to fully connecting individual nodes (18.9 edges on
average instead of 999) and a 96\% reduction in the number of messages (37.8
messages per round per node on average instead of 999). Overall, these results
show that D-Cliques can scale nicely with the number of nodes.
show that D-Cliques can nicely scale with the number of nodes.
\begin{figure}[t]
\centering
......@@ -865,19 +879,20 @@ show that D-Cliques can scale nicely with the number of nodes.
\label{section:related-work}
In this section, we review some related work on dealing with non-IID data in
FL, and on the role of topology in fully decentralized algorithms.
federated learning, and on the role of topology in fully decentralized
algorithms.
\paragraph{Dealing with non-IID data in server-based FL.}
While non-IID data is not an issue in server-based FL if
clients send their parameters to the server after each gradient update,
problems arise when one seeks to reduce
Non-IID data is not much of an issue in server-based FL if
clients send their parameters to the server after each gradient update.
Problems arise when one seeks to reduce
the number of communication rounds by allowing each participant to perform
multiple local updates, as in the popular FedAvg algorithm
\cite{mcmahan2016communication}. Indeed, non-IID data can prevent the
algorithm from
converging to a good solution in this case. This led to the design of
extensions that are specifically designed to mitigate the impact of non-IID
data when performing
\cite{mcmahan2016communication}. Indeed, non-IID data can prevent
such algorithms from
converging to a good solution \cite{quagmire,scaffold}. This led to the design
of algorithms that are specifically designed to mitigate the impact
of non-IID data while performing
multiple local updates, using adaptive client sampling \cite{quagmire}, update
corrections \cite{scaffold} or regularization in the local objective
\cite{fedprox}. Another direction is to embrace the non-IID scenario by
......@@ -896,8 +911,9 @@ cross-gradient
aggregation \cite{cross_gradient}, or multiple averaging steps
between updates (see \cite{consensus_distance} and references therein). These
algorithms
typically require additional communication and/or computation, and have been
only evaluated in small-scale networks with a few tens of nodes.\footnote{We
typically require significantly more communication and/or computation, and
have only been evaluated on small-scale networks with a few tens of
nodes.\footnote{We
also observed that \cite{tang18a} is subject to numerical
instabilities when run on topologies other than rings. When
the rows and columns of $W$ do not exactly
......@@ -928,22 +944,25 @@ that would otherwise bias the direction of the gradient.
\paragraph{Impact of topology in fully decentralized FL.} It is well
known
that the choice of network topology can affect the
convergence of fully decentralized algorithms: this is typically accounted
for in the theoretical convergence rate by a dependence on the spectral gap of
convergence of fully decentralized algorithms. In theoretical convergence
rates, this is typically accounted
for by a dependence on the spectral gap of
the network, see for instance
\cite{Duchi2012a,Colin2016a,lian2017d-psgd,Nedic18}.
However, for IID data, practice contradicts these classic
results: fully decentralized algorithms converge essentially as fast
results as fully decentralized algorithms have been observed to converge
essentially as fast
on sparse topologies like rings or grids as they do on a fully connected
network \cite{lian2017d-psgd,Lian2018}. Recent work
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the IID case. However, these results do not give any clear insight
regarding the role of the topology in the non-IID case. We note that some work
has gone into designing efficient topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but this is done independently
of how data is distributed across nodes. In summary, the role
network resources (see e.g., \cite{marfoq}), but the topology is chosen
independently of how data is distributed across nodes. In summary, the role
of topology in the non-IID data scenario is not well understood and we are not
aware of prior work focusing on this question. Our work shows that an
aware of prior work focusing on this question. Our work shows is the first
to show that an
appropriate choice of data-dependent topology can effectively compensate for
non-IID data.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment