Skip to content
Snippets Groups Projects
Commit ca75fd63 authored by Erick Lavoie's avatar Erick Lavoie
Browse files

Reworked first sections

parent 1045e163
No related branches found
No related tags found
No related merge requests found
...@@ -9,14 +9,16 @@ ...@@ -9,14 +9,16 @@
\usepackage[utf8]{inputenc} \usepackage[utf8]{inputenc}
\usepackage{amsmath} \usepackage{amsmath}
\usepackage{amsfonts} \usepackage{amsfonts}
\usepackage{mathtools}
\usepackage{amssymb} \usepackage{amssymb}
\usepackage{xcolor} \usepackage{xcolor}
\usepackage{soul} \usepackage{soul}
%\usepackage{algorithm} \usepackage[noend]{algorithmic}
%\usepackage[noend]{algpseudocode} %\usepackage[noend]{algpseudocode}
\usepackage{dsfont} \usepackage{dsfont}
\usepackage{caption} \usepackage{caption}
\usepackage{subcaption} \usepackage{subcaption}
\usepackage{todonotes}
% hyperref makes hyperlinks in the resulting PDF. % hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page) % If your build breaks (sometimes temporarily if a hyperlink spans a page)
...@@ -157,21 +159,7 @@ enough such that all participants need only to communicate with a small number o ...@@ -157,21 +159,7 @@ enough such that all participants need only to communicate with a small number o
topologies like rings or grids do not significantly affect the convergence topologies like rings or grids do not significantly affect the convergence
speed compared to using denser topologies. speed compared to using denser topologies.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated \begin{figure*}[ht]
in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
\begin{figure*}[t]
\centering \centering
% From directory results/mnist % From directory results/mnist
...@@ -210,21 +198,39 @@ model with their neighbors. In this paper, we address the following question: ...@@ -210,21 +198,39 @@ model with their neighbors. In this paper, we address the following question:
\label{fig:iid-vs-non-iid-problem} \label{fig:iid-vs-non-iid-problem}
\end{figure*} \end{figure*}
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
Specifically, we make the following contributions: Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in (1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global the joint data distribution of each clique is close to that of the global
(IID) distribution; (2) We propose Clique Averaging, a modified version of (IID) distribution; (2) We propose Greedy Swap, an algorithm for constructing
cliques efficiently in the presence of heterogeneity previously studied in the context of Federated Learning~\cite{mcmahan2016communication};
(3) We propose Clique Averaging, a modified version of
the standard D-SGD algorithm which decouples gradient averaging, used for the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections; converge, therefore reducing the bias introduced by inter-clique connections;
(3) We show how Clique Averaging can be used to implement unbiased momentum (4) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in the non-IID setting; (4) We that would otherwise be detrimental in the non-IID setting; (5) We
demonstrate demonstrate
through an extensive experimental study that our approach removes the effect through an extensive experimental study that our approach removes the effect
of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~ of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep \cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network; (5) Finally, we demonstrate the scalability of our convolutional network; (6) Finally, we demonstrate the scalability of our
approach by considering up to 1000-node networks, in contrast to most approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens previous work on fully decentralized learning that considers only a few tens
of nodes of nodes
...@@ -237,23 +243,24 @@ thereby yielding a 96\% reduction in the total number of required messages ...@@ -237,23 +243,24 @@ thereby yielding a 96\% reduction in the total number of required messages
% (14.5 edges per node on average instead of 18.9) % (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes. is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.
The rest of this paper is organized as follows. We first present the problem The rest of this paper is organized as follows \dots \todo{EL: Complete once structure stabilizes}
statement and our methodology (Section~\ref{section:problem}). The D-Cliques %We first present the problem
design is presented in Section~\ref{section:d-cliques}) along with an %statement and our methodology (Section~\ref{section:problem}). The D-Cliques
empirical illustration of its benefits. In %design is presented in Section~\ref{section:d-cliques}) along with an
Section~\ref{section:clique-averaging-momentum}, we %empirical illustration of its benefits. In
show how to further reduce bias with Clique Averaging and how to use it to %Section~\ref{section:clique-averaging-momentum}, we
implement momentum. We present the results of our extensive experimental %show how to further reduce bias with Clique Averaging and how to use it to
study in Section~\ref{section:non-clustered}. We review some related work in %implement momentum. We present the results of our extensive experimental
Section~\ref{section:related-work}, and conclude with promising directions %study in Section~\ref{section:non-clustered}. We review some related work in
for future work in Section~\ref{section:conclusion}. % Section~\ref{section:related-work}, and conclude with promising directions
% for future work in Section~\ref{section:conclusion}.
\section{Problem Statement} \section{Problem Statement}
\label{section:problem} \label{section:problem}
We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
collaboratively solve a classification task with $c$ classes. Each node has access to a local dataset that collaboratively solve a classification task with $L$ classes. Each node has access to a local dataset that
follows its own local distribution $D_i$. The goal is to find a global model follows its own local distribution $D_i$. The goal is to find a global model
$x$ that performs well on the union of the local distributions by minimizing $x$ that performs well on the union of the local distributions by minimizing
the average training loss: the average training loss:
...@@ -294,100 +301,21 @@ must be doubly ...@@ -294,100 +301,21 @@ must be doubly
stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and
symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}. symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}.
%\begin{algorithm}[t] \begin{algorithm}[t]
% \caption{D-SGD, Node $i$} \caption{D-SGD, Node $i$}
% \label{Algorithm:D-PSGD} \label{Algorithm:D-PSGD}
% \begin{algorithmic}[1] \begin{algorithmic}[1]
% \State \textbf{Require:} initial model parameters $x_i^{(0)}$, \STATE \textbf{Require:} initial model parameters $x_i^{(0)}$,
% learning rate $\gamma$, mixing weights $W$, mini-batch size $m$, learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
% number of steps $K$ number of steps $K$
% \For{$k = 1,\ldots, K$} \FOR{$k = 1,\ldots, K$}
% \State $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
% from~} D_i$ from~} D_i$
% \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$ \STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$
% \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$ \STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
% \EndFor \ENDFOR
% \end{algorithmic} \end{algorithmic}
%\end{algorithm} \end{algorithm}
\subsection{Methodology}
\subsubsection{Non-IID assumptions.}
\label{section:non-iid-assumptions}
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of local class bias}: we
consider that each node only has examples from a single class.
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on
the same number of nodes; (3) All nodes have the same number of examples.
We believe that these assumptions are reasonable in the context of our study
because: (1)
Global class
imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology; (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions. Our results can be extended to support additional compounding factors in future work.
\subsubsection{Experimental setup.}
\label{section:experimental-settings}
Our main goal is to provide a fair comparison of the convergence speed across
different topologies and algorithmic variations, in order to
show that our approach
can remove much of the effect of local class bias.
We experiment with two datasets: MNIST~\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k examples from the original 60k
training set for training and validation respectively. The remaining 5k
training examples were randomly removed to ensure all 10 classes are balanced
while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of
the test set to measure prediction accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy.
We
use a logistic regression classifier for MNIST, which
provides up to 92.5\% accuracy in the centralized setting.
For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
deep convolutional network which achieves an accuracy of $72.3\%$ in the
centralized setting.
These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a
fully decentralized setting and simple enough to configure and analyze.
Regarding hyper-parameters, we jointly optimize the learning rate and
mini-batch size on the
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
For CIFAR10, we additionally use a momentum of $0.9$.
We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
To ignore the impact of distributed execution strategies and system
optimization techniques, we report the test accuracy of all nodes (min, max,
average) as a function of the number of times each example of the dataset has
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}
Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID
node. In both cases, the topology has no effect on
the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent.
\section{D-Cliques: Creating Locally Representative Cliques} \section{D-Cliques: Creating Locally Representative Cliques}
\label{section:d-cliques} \label{section:d-cliques}
...@@ -433,17 +361,24 @@ impractical. ...@@ -433,17 +361,24 @@ impractical.
In D-Cliques, we address the issues of non-iidness by carefully designing a In D-Cliques, we address the issues of non-iidness by carefully designing a
network topology composed of \textit{cliques} and \textit{inter-clique network topology composed of \textit{cliques} and \textit{inter-clique
connections}: connections}.
\begin{itemize}
\item D-Cliques recover a balanced representation of classes, similar to First, D-Cliques recover a balanced representation of classes, close to
that of the IID case, by constructing a topology such that each node is that of the IID case, by constructing a topology such that each node $i \in N$ is
part of a \textit{clique} with neighbors representing all classes. part of a \textit{clique} $C$ such that the clique distribution $D_C = \bigcup\limits_{\substack{i \in C}} D_i$ is close to that of the global distribution $D = \bigcup\limits_{\substack{i \in N}} D_i$. We measure the closeness of $D_C$ to $D$ using its \textit{skew}, i.e. the sum of the differences in the probabilities that a sample $(x,y)$ belongs to the same class in $L$ in $D_C$ and $D$:
\item To ensure a global consensus and convergence, \begin{equation}
\label{eq:skew}
\begin{split}
\textit{skew}(C) =\
\sum\limits_{\substack{l \in L}} | p(y = l~|(x,y) \in D_C) - \\ p(y = l~|(x,y) \in D) |
\end{split}
\end{equation}
Second, to ensure a global consensus and convergence,
\textit{inter-clique connections} \textit{inter-clique connections}
are introduced by connecting a small number of node pairs that are are introduced by connecting a small number of node pairs that are
part of different cliques. part of different cliques. In the following, we introduce up to one inter-clique
\end{itemize} connection per node such that each clique has exactly one
In the following, we introduce up to one inter-clique connection per node such that each clique has exactly one
edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$ corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}. classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
...@@ -461,57 +396,61 @@ topology, namely: ...@@ -461,57 +396,61 @@ topology, namely:
\label{eq:metro} \label{eq:metro}
\end{equation} \end{equation}
We construct D-Cliques by initializing cliques at random, using at most $M$
We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix nodes to limit the intra-clique communication costs, then we
for a formal account of D-Cliques construction. We note that it only requires swap nodes between pairs of cliques chosen at random such that the swap
decreases the skew of that pair but keeps
the size of the cliques constant (see Algorithm~\ref{Algorithm:D-Clique-Construction}).
Only swaps that decrease the skew are performed, hence this algorithm can be
seen as a form of randomized greedy algorithm. We note that this algorithm only requires
the knowledge of the local class distribution at each node. For the sake of the knowledge of the local class distribution at each node. For the sake of
simplicity, we assume that D-Cliques is constructed from the global simplicity, we assume that D-Cliques are constructed from the global
knowledge of these distributions, which can easily be obtained by knowledge of these distributions, which can easily be obtained by
decentralized averaging in a pre-processing step. decentralized averaging in a pre-processing step.
The key idea of D-Cliques is that because the clique-level distribution $D_{ \begin{algorithm}[h]
\textit{clique}} = \sum_{i \caption{D-Cliques Construction: Greedy Swap}
\in \textit{clique}} D_i$ is representative of the global distribution, \label{Algorithm:D-Clique-Construction}
\begin{algorithmic}[1]
\STATE \textbf{Require:} Max clique size $M$, Max steps $K$,
\STATE Set of all nodes $N = \{ 1, 2, \dots, n \}$,
\STATE $\textit{skew}(S)$: skew of subset $S \subseteq N$ compared to the global distribution (Eq.~\ref{eq:skew}),
\STATE $\textit{intra}(DC)$: edges within cliques $C \in DC$,
\STATE $\textit{inter}(DC)$: edges between $C_1,C_2 \in DC$ (Sec.~\ref{section:interclique-topologies}),
\STATE $\textit{weights}(E)$: set weights to edges in $E$ (Eq.~\ref{eq:metro}).
\STATE ~~
\STATE $DC \leftarrow []$ \COMMENT{Empty list}
\WHILE {$N \neq \emptyset$}
\STATE $C \leftarrow$ sample $M$ nodes from $N$ at random
\STATE $N \leftarrow N \setminus C$; $DC.append(C)$
\ENDWHILE
\FOR{$k \in \{1, \dots, K\}$}
\STATE $C_1,C_2 \leftarrow$ sample 2 from $DC$ at random
\STATE $\textit{swaps} \leftarrow []$
\FOR{$n_1 \in C_1, n_2 \in C_2$}
\STATE $s \leftarrow skew(C_1) + skew(C_2)$
\STATE $s' \leftarrow \textit{skew}(C_1-n_1+n_2) + \textit{skew}(C_2 -n_2+n_1)$
\IF {$s' < s$}
\STATE \textit{swaps}.append($(n_1, n_2)$)
\ENDIF
\ENDFOR
\IF {\#\textit{swaps} $> 0$}
\STATE $(n_1,n_2) \leftarrow$ sample 1 from $\textit{swaps}$ at random
\STATE $C_1 \leftarrow C_1 - n_1 + n_2; C_2 \leftarrow C_2 - n_2 + n1$
\ENDIF
\ENDFOR
\RETURN $(weights(\textit{intra}(DC) \cup \textit{inter}(DC)), DC)$
\end{algorithmic}
\end{algorithm}
The key idea of D-Cliques is that because the clique-level distribution $D_C$
is representative of the global distribution $D$,
the local models of nodes across cliques remain rather close. Therefore, a the local models of nodes across cliques remain rather close. Therefore, a
sparse inter-clique topology can be used, significantly reducing the total sparse inter-clique topology can be used, significantly reducing the total
number of edges without slowing down the convergence. Furthermore, the degree number of edges without slowing down the convergence. Furthermore, the degree
of each node in the network remains low and even, making the D-Cliques of each node in the network remains low and even, making the D-Cliques
topology very well-suited to decentralized federated learning. topology very well-suited to decentralized federated learning.
\begin{figure}[t]
\centering
\begin{subfigure}[b]{0.20\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/fully-connected-cliques}
\caption{\label{fig:d-cliques-figure} D-Cliques (fully-connected
cliques)}
\end{subfigure}
\hfill
% To regenerate figure, from results/mnist
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16 --linestyles 'solid' 'dashed'
\begin{subfigure}[b]{0.26\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-vs-fully-connected.png}
\caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed
on MNIST}
\end{subfigure}
\caption{\label{fig:d-cliques-example} D-Cliques topology and convergence
speed on MNIST.}
\end{figure}
Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
performance of D-Cliques on MNIST with $n=100$ nodes. Observe that the
convergence speed is
very close
to that of a fully-connected topology, and significantly better than with
a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With
100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
compared to a fully-connected topology. Nonetheless, there is still
significant variance in the accuracy across nodes, which is due to the bias
introduced by inter-clique edges. We address this issue in the next section.
\section{Optimizing with Clique Averaging and Momentum} \section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum} \label{section:clique-averaging-momentum}
...@@ -560,37 +499,23 @@ providing an equal representation to all classes. In contrast, all neighbors' ...@@ -560,37 +499,23 @@ providing an equal representation to all classes. In contrast, all neighbors'
models, including those across inter-clique edges, participate in the model models, including those across inter-clique edges, participate in the model
averaging step as in the original version. averaging step as in the original version.
%\begin{algorithm}[t] \begin{algorithm}[t]
% \caption{D-SGD with Clique Averaging, Node $i$} \caption{D-SGD with Clique Averaging, Node $i$}
% \label{Algorithm:Clique-Unbiased-D-PSGD} \label{Algorithm:Clique-Unbiased-D-PSGD}
% \begin{algorithmic}[1] \begin{algorithmic}[1]
% \State \textbf{Require} initial model parameters $x_i^{(0)}$, learning \STATE \textbf{Require} initial model parameters $x_i^{(0)}$, learning
% rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
% steps $K$ steps $K$
% \For{$k = 1,\ldots, K$} \FOR{$k = 1,\ldots, K$}
% \State $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
% from~} D_i$ from~} D_i$
% \State $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}} \nabla F(x_j^{(k-1)}; s_j^{(k)})$ \STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}} \nabla F(x_j^{(k-1)}; s_j^{(k)})$
% \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$ \STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$
% \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$ \STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
% \EndFor \ENDFOR
% \end{algorithmic} \end{algorithmic}
%\end{algorithm} \end{algorithm}
% To regenerate figure, from results/mnist:
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
\begin{figure}[t]
\centering
\includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
significantly reduces the variance of models across nodes and accelerates
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
\subsection{Implementing Momentum with Clique Averaging} \subsection{Implementing Momentum with Clique Averaging}
\label{section:momentum} \label{section:momentum}
...@@ -607,6 +532,158 @@ even fails to converge. Not using momentum actually gives a faster ...@@ -607,6 +532,158 @@ even fails to converge. Not using momentum actually gives a faster
convergence, but there is a significant gap compared to the case of a single convergence, but there is a significant gap compared to the case of a single
IID node with momentum. IID node with momentum.
Clique Averaging (Section~\ref{section:clique-averaging})
allows us to compute an unbiased momentum from the
unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)}
\end{equation}
It then suffices to modify the original gradient step to use momentum:
\begin{equation}
x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)}
\end{equation}
\section{Comparative Evaluation and Extensions}
\label{section:non-clustered}
In this section, we first compare D-Cliques to alternative topologies to
confirm the relevance of our main design choices. Then,
we evaluate some extensions of D-Cliques to further reduce the number of
inter-clique connections so as to gracefully scale with the number of
nodes.
\subsection{Methodology}
\subsubsection{Non-IID assumptions.}
\label{section:non-iid-assumptions}
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of local class bias}: we
consider that each node only has examples from a single class.
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on
the same number of nodes; (3) All nodes have the same number of examples.
We believe that these assumptions are reasonable in the context of our study
because: (1)
Global class
imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology; (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions. Our results can be extended to support additional compounding factors in future work.
\subsubsection{Experimental setup.}
\label{section:experimental-settings}
Our main goal is to provide a fair comparison of the convergence speed across
different topologies and algorithmic variations, in order to
show that our approach
can remove much of the effect of local class bias.
We experiment with two datasets: MNIST~\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k examples from the original 60k
training set for training and validation respectively. The remaining 5k
training examples were randomly removed to ensure all 10 classes are balanced
while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of
the test set to measure prediction accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy.
We
use a logistic regression classifier for MNIST, which
provides up to 92.5\% accuracy in the centralized setting.
For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
deep convolutional network which achieves an accuracy of $72.3\%$ in the
centralized setting.
These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a
fully decentralized setting and simple enough to configure and analyze.
Regarding hyper-parameters, we jointly optimize the learning rate and
mini-batch size on the
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
For CIFAR10, we additionally use a momentum of $0.9$.
We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
To ignore the impact of distributed execution strategies and system
optimization techniques, we report the test accuracy of all nodes (min, max,
average) as a function of the number of times each example of the dataset has
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}
Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID
node. In both cases, the topology has no effect on
the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent.
\subsection{Main Results}
\begin{figure}[t]
\centering
\begin{subfigure}[b]{0.20\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/fully-connected-cliques}
\caption{\label{fig:d-cliques-figure} D-Cliques (fully-connected
cliques)}
\end{subfigure}
\hfill
% To regenerate figure, from results/mnist
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16 --linestyles 'solid' 'dashed'
\begin{subfigure}[b]{0.26\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-mnist-vs-fully-connected.png}
\caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed
on MNIST}
\end{subfigure}
\caption{\label{fig:d-cliques-example} D-Cliques topology and convergence
speed on MNIST.}
\end{figure}
Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
performance of D-Cliques on MNIST with $n=100$ nodes. Observe that the
convergence speed is
very close
to that of a fully-connected topology, and significantly better than with
a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With
100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
compared to a fully-connected topology. Nonetheless, there is still
significant variance in the accuracy across nodes, which is due to the bias
introduced by inter-clique edges. We address this issue in the next section.
\subsection{Importance of Clique Averaging and Momentum}
% To regenerate figure, from results/mnist:
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
\begin{figure}[t]
\centering
\includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
significantly reduces the variance of models across nodes and accelerates
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
\begin{figure}[t] \begin{figure}[t]
\centering \centering
% To regenerate figure, from results/cifar10 % To regenerate figure, from results/cifar10
...@@ -627,31 +704,11 @@ IID node with momentum. ...@@ -627,31 +704,11 @@ IID node with momentum.
\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet} \caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
\end{figure} \end{figure}
We show here that Clique Averaging (Section~\ref{section:clique-averaging})
allows us to compute an unbiased momentum from the
unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)}
\end{equation}
It then suffices to modify the original gradient step to use momentum:
\begin{equation}
x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)}
\end{equation}
As shown in As shown in
Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect},
simple modification restores the benefits of momentum and closes the gap the use of Clique Averaging restores the benefits of momentum and closes the gap
with the centralized setting. with the centralized setting.
\section{Comparative Evaluation and Extensions}
\label{section:non-clustered}
In this section, we first compare D-Cliques to alternative topologies to
confirm the relevance of our main design choices. Then,
we evaluate some extensions of D-Cliques to further reduce the number of
inter-clique connections so as to gracefully scale with the number of
nodes.
\subsection{Comparing D-Cliques to Other Sparse Topologies} \subsection{Comparing D-Cliques to Other Sparse Topologies}
We demonstrate the advantages of D-Cliques over alternative sparse topologies We demonstrate the advantages of D-Cliques over alternative sparse topologies
...@@ -872,6 +929,11 @@ show that D-Cliques can nicely scale with the number of nodes. ...@@ -872,6 +929,11 @@ show that D-Cliques can nicely scale with the number of nodes.
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.} \caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
\end{figure*} \end{figure*}
\subsection{Cost of Constructing Cliques}
\label{section:cost-cliques}
\dots \todo{EL: Add plots showing convergence speed in terms of skew vs iteration number, as well as absolute computation time}
\section{Related Work} \section{Related Work}
\label{section:related-work} \label{section:related-work}
...@@ -991,18 +1053,16 @@ approximately recover the global distribution. ...@@ -991,18 +1053,16 @@ approximately recover the global distribution.
\appendix \appendix
\section{Detailed Algorithms} \section{Detailed Algorithms}
We present a more detailed and precise explanation of the two main algorithms We present a more detailed and precise explanation the algorithm to establish a small-world
of the paper, for D-Cliques construction
(Algorithm~\ref{Algorithm:D-Clique-Construction}) and to establish a small-world
inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}). inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}).
\subsection{D-Cliques Construction} % \subsection{D-Cliques Construction}
%
Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach % Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
for constructing a D-Cliques topology in the non-IID case.\footnote{An IID % for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
version of D-Cliques, in which each node has an equal number of examples of % version of D-Cliques, in which each node has an equal number of examples of
all classes, can be implemented by picking $\#L$ nodes per clique at random.} % all classes, can be implemented by picking $\#L$ nodes per clique at random.}
It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}). % It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
% \begin{algorithm}[h] % \begin{algorithm}[h]
% \caption{D-Cliques Construction} % \caption{D-Cliques Construction}
...@@ -1045,34 +1105,33 @@ Algorithm~\ref{Algorithm:Smallworld} instantiates the function ...@@ -1045,34 +1105,33 @@ Algorithm~\ref{Algorithm:Smallworld} instantiates the function
small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set. linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
%\begin{algorithm}[h] \begin{algorithm}[h]
% \caption{$\textit{smallworld}(DC)$: adds $O(\# N \log(\# N))$ edges} \caption{$\textit{smallworld}(DC)$: adds $O(\# N \log(\# N))$ edges}
% \label{Algorithm:Smallworld} \label{Algorithm:Smallworld}
% \begin{algorithmic}[1] \begin{algorithmic}[1]
% \State \textbf{Require:} set of cliques $DC$ (set of set of nodes) \STATE \textbf{Require:} set of cliques $DC$ (set of set of nodes)
% \State ~~size of neighborhood $ns$ (default 2) \STATE ~~size of neighborhood $ns$ (default 2)
% \State ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$ \STATE ~~function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
% \State $E \leftarrow \emptyset$ \Comment{Set of Edges} \STATE $E \leftarrow \emptyset$ \COMMENT{Set of Edges}
% \State $L \leftarrow [ C~\text{for}~C \in DC ]$ \Comment{Arrange cliques in a list} \STATE $L \leftarrow [ C~\text{for}~C \in DC ]$ \COMMENT{Arrange cliques in a list}
% \For{$i \in \{1,\dots,\#DC\}$} \Comment{For every clique} \FOR{$i \in \{1,\dots,\#DC\}$} % \COMMENT{For every clique}
% \State \Comment{For sets of cliques exponentially further away from $i$} % %\STATE ~\COMMENT{For sets of cliques exponentially further away from $i$}
% \For{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots, \FOR{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots, \lceil \log_2(\#DC) \rceil \} \}$}
% \lceil \log_2(\#DC) \rceil \} \}$} % \STATE \COMMENT{Pick the $ns$ closest}
% \State \Comment{Pick the $ns$ closest} \FOR{$k \in \{0,\dots,ns-1\}$}
% \For{$k \in \{0,\dots,ns-1\}$} % \STATE \COMMENT{Add inter-clique connections in both directions}
% \State \Comment{Add inter-clique connections in both directions} \STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
% \State $n \leftarrow \textit{least\_edges}(L_i, E)$ \STATE $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ %\COMMENT{clockwise in ring}
% \State $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ \Comment{clockwise in ring} \STATE $E \leftarrow E \cup \{ \{n,m\} \}$
% \State $E \leftarrow E \cup \{ \{n,m\} \}$ \STATE $n \leftarrow \textit{least\_edges}(L_i, E)$
% \State $n \leftarrow \textit{least\_edges}(L_i, E)$ \STATE $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ %\COMMENT{counter-clockwise in ring}
% \State $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ \Comment{counter-clockwise in ring} \STATE $E \leftarrow E \cup \{ \{n,m\} \}$
% \State $E \leftarrow E \cup \{ \{n,m\} \}$ \ENDFOR
% \EndFor \ENDFOR
% \EndFor \ENDFOR
% \EndFor \RETURN E
% \State \Return E \end{algorithmic}
% \end{algorithmic} \end{algorithm}
%\end{algorithm}
Algorithm~\ref{Algorithm:Smallworld} expects a set of cliques $DC$, previously computed by Algorithm~\ref{Algorithm:Smallworld} expects a set of cliques $DC$, previously computed by
Algorithm~\ref{Algorithm:D-Clique-Construction}; a size of neighborhood $ns$, Algorithm~\ref{Algorithm:D-Clique-Construction}; a size of neighborhood $ns$,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment