% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
...
@@ -157,21 +159,7 @@ enough such that all participants need only to communicate with a small number o
...
@@ -157,21 +159,7 @@ enough such that all participants need only to communicate with a small number o
topologies like rings or grids do not significantly affect the convergence
topologies like rings or grids do not significantly affect the convergence
speed compared to using denser topologies.
speed compared to using denser topologies.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
\begin{figure*}[ht]
in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
\begin{figure*}[t]
\centering
\centering
% From directory results/mnist
% From directory results/mnist
...
@@ -210,21 +198,39 @@ model with their neighbors. In this paper, we address the following question:
...
@@ -210,21 +198,39 @@ model with their neighbors. In this paper, we address the following question:
\label{fig:iid-vs-non-iid-problem}
\label{fig:iid-vs-non-iid-problem}
\end{figure*}
\end{figure*}
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
Specifically, we make the following contributions:
Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global
the joint data distribution of each clique is close to that of the global
(IID) distribution; (2) We propose Clique Averaging, a modified version of
(IID) distribution; (2) We propose Greedy Swap, an algorithm for constructing
cliques efficiently in the presence of heterogeneity previously studied in the context of Federated Learning~\cite{mcmahan2016communication};
(3) We propose Clique Averaging, a modified version of
the standard D-SGD algorithm which decouples gradient averaging, used for
the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections;
converge, therefore reducing the bias introduced by inter-clique connections;
(3) We show how Clique Averaging can be used to implement unbiased momentum
(4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the non-IID setting; (4) We
that would otherwise be detrimental in the non-IID setting; (5) We
demonstrate
demonstrate
through an extensive experimental study that our approach removes the effect
through an extensive experimental study that our approach removes the effect
of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network; (5) Finally, we demonstrate the scalability of our
convolutional network; (6) Finally, we demonstrate the scalability of our
approach by considering up to 1000-node networks, in contrast to most
approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
previous work on fully decentralized learning that considers only a few tens
of nodes
of nodes
...
@@ -237,23 +243,24 @@ thereby yielding a 96\% reduction in the total number of required messages
...
@@ -237,23 +243,24 @@ thereby yielding a 96\% reduction in the total number of required messages
% (14.5 edges per node on average instead of 18.9)
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.
The rest of this paper is organized as follows. We first present the problem
The rest of this paper is organized as follows \dots\todo{EL: Complete once structure stabilizes}
statement and our methodology (Section~\ref{section:problem}). The D-Cliques
%We first present the problem
design is presented in Section~\ref{section:d-cliques}) along with an
%statement and our methodology (Section~\ref{section:problem}). The D-Cliques
empirical illustration of its benefits. In
%design is presented in Section~\ref{section:d-cliques}) along with an
Section~\ref{section:clique-averaging-momentum}, we
%empirical illustration of its benefits. In
show how to further reduce bias with Clique Averaging and how to use it to
%Section~\ref{section:clique-averaging-momentum}, we
implement momentum. We present the results of our extensive experimental
%show how to further reduce bias with Clique Averaging and how to use it to
study in Section~\ref{section:non-clustered}. We review some related work in
%implement momentum. We present the results of our extensive experimental
Section~\ref{section:related-work}, and conclude with promising directions
%study in Section~\ref{section:non-clustered}. We review some related work in
for future work in Section~\ref{section:conclusion}.
% Section~\ref{section:related-work}, and conclude with promising directions
% for future work in Section~\ref{section:conclusion}.
\section{Problem Statement}
\section{Problem Statement}
\label{section:problem}
\label{section:problem}
We consider a set $N =\{1, \dots, n \}$ of $n$ nodes seeking to
We consider a set $N =\{1, \dots, n \}$ of $n$ nodes seeking to
collaboratively solve a classification task with $c$ classes. Each node has access to a local dataset that
collaboratively solve a classification task with $L$ classes. Each node has access to a local dataset that
follows its own local distribution $D_i$. The goal is to find a global model
follows its own local distribution $D_i$. The goal is to find a global model
$x$ that performs well on the union of the local distributions by minimizing
$x$ that performs well on the union of the local distributions by minimizing
the average training loss:
the average training loss:
...
@@ -294,100 +301,21 @@ must be doubly
...
@@ -294,100 +301,21 @@ must be doubly
stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$) and
stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$) and
symmetric, i.e. $W_{ij}= W_{ji}$~\cite{lian2017d-psgd}.
symmetric, i.e. $W_{ij}= W_{ji}$~\cite{lian2017d-psgd}.
%\begin{algorithm}[t]
\begin{algorithm}[t]
% \caption{D-SGD, Node $i$}
\caption{D-SGD, Node $i$}
% \label{Algorithm:D-PSGD}
\label{Algorithm:D-PSGD}
% \begin{algorithmic}[1]
\begin{algorithmic}[1]
% \State \textbf{Require:} initial model parameters $x_i^{(0)}$,
\STATE\textbf{Require:} initial model parameters $x_i^{(0)}$,
In D-Cliques, we address the issues of non-iidness by carefully designing a
In D-Cliques, we address the issues of non-iidness by carefully designing a
network topology composed of \textit{cliques} and \textit{inter-clique
network topology composed of \textit{cliques} and \textit{inter-clique
connections}:
connections}.
\begin{itemize}
\item D-Cliques recover a balanced representation of classes, similar to
First, D-Cliques recover a balanced representation of classes, close to
that of the IID case, by constructing a topology such that each node is
that of the IID case, by constructing a topology such that each node $i \in N$ is
part of a \textit{clique} with neighbors representing all classes.
part of a \textit{clique}$C$ such that the clique distribution $D_C =\bigcup\limits_{\substack{i \in C}} D_i$ is close to that of the global distribution $D =\bigcup\limits_{\substack{i \in N}} D_i$. We measure the closeness of $D_C$ to $D$ using its \textit{skew}, i.e. the sum of the differences in the probabilities that a sample $(x,y)$ belongs to the same class in $L$ in $D_C$ and $D$:
\item To ensure a global consensus and convergence,
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
significantly reduces the variance of models across nodes and accelerates
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx80\%$.
\subsection{Implementing Momentum with Clique Averaging}
\subsection{Implementing Momentum with Clique Averaging}
\label{section:momentum}
\label{section:momentum}
...
@@ -607,6 +532,158 @@ even fails to converge. Not using momentum actually gives a faster
...
@@ -607,6 +532,158 @@ even fails to converge. Not using momentum actually gives a faster
convergence, but there is a significant gap compared to the case of a single
convergence, but there is a significant gap compared to the case of a single
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
significantly reduces the variance of models across nodes and accelerates
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx80\%$.
\begin{figure}[t]
\begin{figure}[t]
\centering
\centering
% To regenerate figure, from results/cifar10
% To regenerate figure, from results/cifar10
...
@@ -627,31 +704,11 @@ IID node with momentum.
...
@@ -627,31 +704,11 @@ IID node with momentum.
\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
\end{figure}
\end{figure}
We show here that Clique Averaging (Section~\ref{section:clique-averaging})
allows us to compute an unbiased momentum from the
unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)}\leftarrow m v_i^{(k-1)} + g_i^{(k)}
\end{equation}
It then suffices to modify the original gradient step to use momentum:
simple modification restores the benefits of momentum and closes the gap
the use of Clique Averaging restores the benefits of momentum and closes the gap
with the centralized setting.
with the centralized setting.
\section{Comparative Evaluation and Extensions}
\label{section:non-clustered}
In this section, we first compare D-Cliques to alternative topologies to
confirm the relevance of our main design choices. Then,
we evaluate some extensions of D-Cliques to further reduce the number of
inter-clique connections so as to gracefully scale with the number of
nodes.
\subsection{Comparing D-Cliques to Other Sparse Topologies}
\subsection{Comparing D-Cliques to Other Sparse Topologies}
We demonstrate the advantages of D-Cliques over alternative sparse topologies
We demonstrate the advantages of D-Cliques over alternative sparse topologies
...
@@ -872,6 +929,11 @@ show that D-Cliques can nicely scale with the number of nodes.
...
@@ -872,6 +929,11 @@ show that D-Cliques can nicely scale with the number of nodes.
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
\end{figure*}
\end{figure*}
\subsection{Cost of Constructing Cliques}
\label{section:cost-cliques}
\dots\todo{EL: Add plots showing convergence speed in terms of skew vs iteration number, as well as absolute computation time}
\section{Related Work}
\section{Related Work}
\label{section:related-work}
\label{section:related-work}
...
@@ -991,18 +1053,16 @@ approximately recover the global distribution.
...
@@ -991,18 +1053,16 @@ approximately recover the global distribution.
\appendix
\appendix
\section{Detailed Algorithms}
\section{Detailed Algorithms}
We present a more detailed and precise explanation of the two main algorithms
We present a more detailed and precise explanation the algorithm to establish a small-world
of the paper, for D-Cliques construction
(Algorithm~\ref{Algorithm:D-Clique-Construction}) and to establish a small-world
Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
% Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
% for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
version of D-Cliques, in which each node has an equal number of examples of
% version of D-Cliques, in which each node has an equal number of examples of
all classes, can be implemented by picking $\#L$ nodes per clique at random.}
% all classes, can be implemented by picking $\#L$ nodes per clique at random.}
It expects the following inputs: $L$, the set of all classes present in the global distribution $D =\bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S =\bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots\}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots\}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
% It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
% \begin{algorithm}[h]
% \begin{algorithm}[h]
% \caption{D-Cliques Construction}
% \caption{D-Cliques Construction}
...
@@ -1045,34 +1105,33 @@ Algorithm~\ref{Algorithm:Smallworld} instantiates the function
...
@@ -1045,34 +1105,33 @@ Algorithm~\ref{Algorithm:Smallworld} instantiates the function
small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
%\begin{algorithm}[h]
\begin{algorithm}[h]
% \caption{$\textit{smallworld}(DC)$: adds $O(\# N \log(\# N))$ edges}
\caption{$\textit{smallworld}(DC)$: adds $O(\# N \log(\# N))$ edges}
% \label{Algorithm:Smallworld}
\label{Algorithm:Smallworld}
% \begin{algorithmic}[1]
\begin{algorithmic}[1]
% \State \textbf{Require:} set of cliques $DC$ (set of set of nodes)
\STATE\textbf{Require:} set of cliques $DC$ (set of set of nodes)
% \State ~~size of neighborhood $ns$ (default 2)
\STATE ~~size of neighborhood $ns$ (default 2)
% \State ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
\STATE ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
% \State $E \leftarrow \emptyset$ \Comment{Set of Edges}
\STATE$E \leftarrow\emptyset$\COMMENT{Set of Edges}
% \State $L \leftarrow [ C~\text{for}~C \in DC ]$ \Comment{Arrange cliques in a list}
\STATE$L \leftarrow[ C~\text{for}~C \in DC ]$\COMMENT{Arrange cliques in a list}
% \For{$i \in \{1,\dots,\#DC\}$} \Comment{For every clique}
\FOR{$i \in\{1,\dots,\#DC\}$}% \COMMENT{For every clique}
% \State \Comment{For sets of cliques exponentially further away from $i$}
% %\STATE ~\COMMENT{For sets of cliques exponentially further away from $i$}