Skip to content
Snippets Groups Projects
Commit 24d2d423 authored by aurelien.bellet's avatar aurelien.bellet
Browse files

intro

parent 61ee6167
No related branches found
No related tags found
No related merge requests found
......@@ -66,6 +66,8 @@ provides similar convergence speed as a fully-connected topology with a
significant reduction in the number of edges and messages. In a 1000-node
topology, D-Cliques requires 98\% less edges and 96\% less total messages,
with further possible gains using a small-world topology across cliques.
% Our study paves the way for tackling more general types of data non-IIDness
% through the design of appropriate topologies.
\keywords{Decentralized Learning \and Federated Learning \and Topology \and
Non-IID Data \and Stochastic Gradient Descent}
......@@ -83,12 +85,18 @@ Non-IID Data \and Stochastic Gradient Descent}
Machine learning is currently shifting from a \emph{centralized}
paradigm, in which models are trained on data located on a single machine or
in a data center, to \emph{decentralized} ones.
Effectively, such paradigm matches the natural data distribution as data is collected by several independent parties (hospitals, companies, personal devices...) and trained on participants' devices.
Effectively, the latter paradigm closely matches the natural data distribution
in the numerous use-cases where data is collected and processed by several
independent
parties (hospitals, companies, personal devices...).
Federated Learning (FL) allows a set
of data owners to collaboratively train machine learning models
of participants to collaboratively train machine learning models
on their joint
data while keeping it where it has been produced. Not only, this avoids the costs of moving data but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
Yet, such a data distribution challenges learning systems, as local datasets reflect the usage and production patterns specific to each participant: they are
data while keeping it where it has been produced. Not only does this avoid
the costs of moving data, but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
Yet, such data distribution introduces new challenges for learning systems, as
local datasets
reflect the usage and production patterns specific to each participant: they are
\emph{not} independent and identically distributed
(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
across local datasets \cite{quagmire}.
......@@ -97,13 +105,14 @@ can efficiently deal with such non-IID data
\cite{kairouz2019advances,fedprox,scaffold,quagmire}.
Federated learning algorithms can be classified into two categories depending
on the underlying network topology they run on. In server-based FL, the network is organized according to a star topology: a central server orchestrates the training process and
iteratively aggregates model updates received from the participants
(\emph{clients}) and sends
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending
them back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary topology where
participants communicate only with their direct neighbors
in the underlying communcation graph. A classic example of such algorithms is Decentralized
fully decentralized FL algorithms operate over an arbitrary graph topology
where participants communicate only with their direct neighbors
in the graph. A classic example of such algorithms is Decentralized
SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
local SGD updates and model averaging with neighboring nodes.
......@@ -112,40 +121,15 @@ generally scale better to the large number of participants seen in ``cross-devic
applications \cite{kairouz2019advances}. Effectively, while a central
server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree
\cite{lian2017d-psgd}. Recent work has shown both empirically
\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids do not significantly affect the convergence
speed compared to using denser topologies with IID data.
speed compared to using denser topologies.
% We also note that full decentralization can also provide benefits in terms of
% privacy protection \cite{amp_dec}.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremly significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local distributions do not have relative frequency of classes similar to the global distribution, i.e. they have \textit{local class bias}. We also observe that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected graph communication network under a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
%Indeed, as we show with the following contributions:
In this paper, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in cliques, i.e. locally fully-connected sets of nodes, such that the joint data distribution of each clique is representative of the global (IID) distribution; (2) We propose Clique Averaging, a modified version of the standard D-SGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure all models converge, therefore reducing the bias introduced by inter-clique connections; (3) We show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in a non-IID setting; (4) We demonstrate though an extensive experimental study that our approach removes the effect of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~\cite{krizhevsky2009learning} datasets, with a linear and deep convolutional network; (5) Finally, we demonstrate the scalability of our approach by considering up to 1000 node networks, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants~\cite{tang18a,more_refs}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
For instance, our results show that using D-Cliques in a 1000 node network requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
The rest of this paper is organized as follows. We first present the problem statement and methodology (Section~\ref{section:problem}). The D-Cliques design is presented in Section~\ref{section:d-cliques}) along with its benefit. In Section~\ref{section:clique-averaging}, we show how to further reduce bias with Clique Averaging and how to use it to implement momentum. We present the results or our extensive experimental study in Section~\ref{section:non-clustered}). Related work is surveyed in Section~\ref{section:related-work}) before we conclude in (Section~\ref{section:conclusion}.
%When then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
\begin{figure}
\begin{figure}[t]
\centering
% From directory results/mnist
......@@ -153,7 +137,7 @@ The rest of this paper is organized as follows. We first present the problem sta
\begin{subfigure}[b]{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/ring-IID-vs-non-IID}
\caption{\label{fig:ring-IID-vs-non-IID} Ring: (almost) minimal connectivity.}
\caption{\label{fig:ring-IID-vs-non-IID} Ring}
\end{subfigure}
\quad
% From directory results/mnist
......@@ -161,7 +145,7 @@ The rest of this paper is organized as follows. We first present the problem sta
\begin{subfigure}[b]{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/grid-IID-vs-non-IID}
\caption{\label{fig:grid-IID-vs-non-IID} Grid: intermediate connectivity.}
\caption{\label{fig:grid-IID-vs-non-IID} Grid}
\end{subfigure}
\quad
% From directory results/mnist
......@@ -169,14 +153,78 @@ The rest of this paper is organized as follows. We first present the problem sta
\begin{subfigure}[b]{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/fully-connected-IID-vs-non-IID}
\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected: max connectivity.}
\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected}
\end{subfigure}
\caption{IID vs non-IID Convergence Speed on MNIST with Linear Model. Thin lines are the minimum
and maximum accuracy of individual nodes. Bold lines are the average
accuracy across all nodes. The blue curve shows convergence in the IID case: the topology has limited effect. The orange curve shows convergence in the non-IID case: the topology has a significant effect. When fully-connected, both cases converge similarly. See Section~\ref{section:experimental-settings} for experimental settings.}
\caption{IID vs non-IID convergence speed of decentralized SGD for
logistic regression on
MNIST for different topologies. Bold lines show the average test
accuracy across nodes
while thin lines show the minimum
and maximum accuracy of individual nodes. While the effect of topology
is negligible for IID data, it is very significant in the
non-IID case. When fully-connected, both cases converge similarly. See
Section~\ref{section:experimental-settings} for details on
the experimental setup.}
\label{fig:iid-vs-non-iid-problem}
\end{figure}
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network graph under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
%Indeed, as we show with the following contributions:
In this paper, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global
(IID) distribution; (2) We propose Clique Averaging, a modified version of
the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections;
(3) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in a non-IID setting; (4) We demonstrate
through an extensive experimental study that our approach removes the effect
of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network; (5) Finally, we demonstrate the scalability of our
approach by considering up to 1000 node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
The rest of this paper is organized as follows. We first present the problem
statement and our methodology (Section~\ref{section:problem}). The D-Cliques
design is presented in Section~\ref{section:d-cliques}) along with an
empirical illustration of its benefits. In
Section~\ref{section:clique-averaging-momentum}, we
show how to further reduce bias with Clique Averaging and how to use it to
implement momentum. We present the results or our extensive experimental
study in Section~\ref{section:non-clustered}. We review some related work in
Section~\ref{section:related-work}), and conclude with promising directions
for future work in Section~\ref{section:conclusion}.
%When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
%\footnotetext{This is different from the accuracy of the average model across nodes that is sometimes used once training is completed.}
\section{Problem Statement}
......@@ -338,7 +386,7 @@ A network of 100 non-IID nodes with D-Cliques is illustrated in Figure~\ref{fig:
\section{Optimizing with Clique Averaging}
\label{section:clique-averaging}
\label{section:clique-averaging-momentum}
In this section we present Clique Averaging,a feature that removes further the bias introduce by data non-iidness.
%AMK: check
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment