Skip to content
Snippets Groups Projects
Commit e8566357 authored by Erick Lavoie's avatar Erick Lavoie
Browse files

Initial commit

parents
No related branches found
No related tags found
No related merge requests found
.DS_Store
figures/.DS_Store
*.aux
*.bbl
*.blg
*.log
*.out
*.pdf
*.synctex.gz
File added
figures/d-cliques-cifar10-vs-1-node-training-loss.png

141 KiB

figures/d-cliques-cifar10-vs-1-node-validation-accuracy.png

89.9 KiB

File added
figures/fully-connected-IID-vs-non-IID.png

54.7 KiB

File added
figures/grid-IID-vs-non-IID.png

115 KiB

figures/grid-iid-neighbourhood.png

18.7 KiB

figures/grid-non-iid-neighbourhood.png

14.2 KiB

figures/ring-IID-vs-non-IID.png

86.5 KiB

This diff is collapsed.
This diff is collapsed.
main.tex 0 → 100644
% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.20 of 2017/10/04
%
\documentclass[runningheads]{llncs}
%
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
%\usepackage{amsthm}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{soul}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{dsfont}
\usepackage{caption}
\usepackage{subcaption}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following line
% to display URLs in blue roman font according to Springer's eBook style:
% \renewcommand\UrlFont{\color{blue}\rmfamily}
\begin{document}
%
\title{D-Cliques: An Efficient Topology to Compensate for Non-IID Data in Decentralized Learning}
%
\titlerunning{D-Cliques}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Aur\'elien Bellet\inst{1}\thanks{Authors in alphabetical order of last names, see Section 'Credits' for respective contributions.} \and
Anne-Marie Kermarrec\inst{2} \and
Erick Lavoie\inst{2}}
%
\authorrunning{A. Bellet, A-M. Kermarrec, E. Lavoie}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{Inria, Lille, France\\
\email{aurelien.bellet@inria.fr} \and
EPFL, Lausanne, Switzerland \\
\email{\{anne-marie.kermarrec,erick.lavoie\}@epfl.ch}\\
}
%
\maketitle % typeset the header of the contribution
%
\begin{abstract}
The abstract should briefly summarize the contents of the paper in
150--250 words.
\keywords{Decentralized Learning \and Federated Learning \and Topology \and Stochastic Gradient Descent}
\end{abstract}
%
%
%
\section{Introduction}
TODO: Short verbal introduction to D-PSGD
\begin{figure}
\centering
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/ring-IID-vs-non-IID}
\caption{\label{fig:ring-IID-vs-non-IID} Ring: (almost) minimal connectivity.}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.25\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/grid-IID-vs-non-IID}
\caption{\label{fig:grid-IID-vs-non-IID} Grid: intermediate connectivity.}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/fully-connected-IID-vs-non-IID}
\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected: maximal connectivity.}
\end{subfigure}
\caption{IID vs non-IID Convergence Speed. Thin lines are the minimum and maximum accuracy of individual nodes. Bold lines are the average accuracy across all nodes.\protect \footnotemark}
\label{fig:iid-vs-non-iid-problem}
\end{figure}
\footnotetext{This is different from the accuracy of the average model across nodes that is sometimes used once training is completed.}
\textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
\subsection{Bias in Gradient Averaging with Non-IID Data}
To have a preliminary intuition of the impact of non-IID data on convergence speed, examine the local neighbourhood of a single node in a grid similar to that used to obtain results in Figure~\ref{fig:grid-IID-vs-non-IID}, as illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}. The color of a node, represented as a circle, corresponds to one of the 10 available classes in the dataset. In this IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all ten classes in equal proportions. In this (rather extreme) non-IID case (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid, with neighbourhood such as this one, sometimes having nodes with examples of the same class adjacent to each other.
\begin{figure}
\centering
\begin{subfigure}[b]{0.33\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/grid-iid-neighbourhood}
\caption{\label{fig:grid-iid-neighbourhood} IID}
\end{subfigure}
\begin{subfigure}[b]{0.33\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
\caption{\label{fig:grid-non-iid-neighbourhood} Non-IID}
\end{subfigure}
\caption{Neighbourhood in an IID and non-IID Grid.}
\label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}
For the sake of the argument, assume all nodes are initialized with the same model weights, which is not critical for quick convergence in an IID setting but makes the comparison easier. A single training step, from the point of view of the middle node of the illustrated neighbourhood, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of the five illustrated nodes.
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all classes. This is the case because the components of the gradient that would only improve the loss on a subset of the classes to the detriment of others are cancelled by similar but opposite components from other classes. Therefore only the components that improve the loss for all classes remain. There is some variance remaining from the difference between examples but in practice it has a sufficiently small impact on convergence speed that there are still benefits from parallelizing the computations.
However, in the (rather extreme) non-IID case illustrated, there are not enough nodes in the neighbourhood to remove the bias of the classes represented. Even if all nodes start from the same model weights, they will diverge from one another according to the classes represented in their neighbourhood, more than they would have had in the IID case. As the distributed averaging algorithm takes several steps to converge, this variance is never fully resolved and the variance remains between steps.\footnote{It is possible, but impractical, to compensate for this effect by averaging multiple times before the next gradient computation. In effect, this trades connectivity (number of edges) for latency to give the same convergence speed, in number of gradients computed, as a fully connected graph.} This additional variance biases subsequent gradient computations as the gradients are computed further away from the global average, in addition to being computed from different examples. As shown in Figure~\ref{fig:ring-IID-vs-non-IID} and \ref{fig:grid-IID-vs-non-IID}, this significantly slows down convergence speed to the point of making parallel optimization impractical.
\subsection{D-Cliques}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.4\textwidth]{figures/fully-connected-cliques}
\caption{\label{fig:d-cliques-example} D-Cliques: Connected Cliques of Dissimilar Nodes, Locally Representative of the Global Distribution}
\end{figure}
If we relax the constraint of regularity, a trivial solution is a star topology, as used in most Federated Learning implementations (CITE) at the expense of a high requirement on reliability and available bandwidth on the central node. We instead propose a regular topology, built around \textit{cliques} of dissimilar nodes, locally representative of the global distribution and connected by few links, as illustrated in Figure~\ref{fig:d-cliques-example}. D-Cliques enable similar convergence speed as a fully connected topology, using a number of edges that grows sub-exponentially ($O(nc + \frac{n^2}{c^2})$ where $n$ is the number of nodes and $c$ is the size of a clique\footnote{$O((\frac{n}{c})c^2 + (\frac{n}{c})^2)$, i.e. number of cliques times the number of edges within cliques (squared in the size of cliques) in addition to inter-cliques edges (square of the number of cliques).}.), instead of exponentially in the number of nodes ($O(n^2)$), with a corresponding reduction in bandwidth usage and required number of messages per round of training. In practice, for the cases with networks of size 100 we have tested, that corresponds to a reduction in the number of edges of 90\%. (TODO: Do analysis if the pattern is fractal with three levels at 1000 nodes: cliques, 10 cliques connected pairwise in a "region", and each "region" connected pairwise with other regions)
Because the data distribution within each clique is representative of the global distribution, we can recover optimization techniques that rely on an IID assumption, in a distributed setting that is not. As one example, we show how momentum (CITE) can be used with D-Cliques to greatly improve convergence speed of convolutional networks, as in a centralized IID setting, even though the technique is otherwise \textit{detrimental} in a more general non-IID setting.
As a summary, we make the following contributions:
\begin{itemize}
\item we propose the D-Cliques topology to remove the impact of non-IID data on convergence speed, similar to a fully-connected topology, with a reduced number of edges and required messages
\item we show how to leverage D-Cliques to implement momentum in a distributed non-IID setting, which would otherwise be detrimental to the convergence speed of convolutional networks
\end{itemize}
The rest of the paper is organized as such. \dots
\section{Related Work}
D2: numerically unstable when $W_{ij}$ rows and columns do not exactly sum to $1$, as the small differences are amplified in a positive feedback loop. More work is therefore required on the algorithm to make it usable with a wider variety of topologies. In comparison, D-cliques do not modify the SGD algorithm and instead simply removes some neighbour contributions that would otherwise bias the direction of the gradient. D-Cliques with D-PSGD are therefore as tolerant to ill-conditioned $W_{ij}$ matrices as regular D-PSGD in an IID setting.
\section{Problem Statement}
\label{section:problem}
A set of $n$ nodes $N = \{1, \dots, n \}$ communicates with their neighbours defined by the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
Training data is sampled from a global distribution $D$ unknown to the nodes. Each node has access to an arbitrary partition of the samples that follows its own local distribution $D_i$. Nodes cooperate to reach consensus on a global model $M$ that performs well on $D$ by minimizing the average training loss on local models:
\begin{equation}
min_{x_i, i = 1, \dots, n} = \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_{s_i \sim D_i} F_i(x_i;s_i)
\label{eq:dist-optimization-problem}
\end{equation}
such that $M= x_i = x_j, \forall i,j \in N$, where $x_i$ are the parameters of
node $i$'s local model, $s_i$ is a sample of $D_i$, $F_i$ is the loss function
on node $i$, and $\mathds{E}_{s_i \sim D_i} F_i(x_i;s_i)$ denotes the
expected value of $F_i$ on a random sample $s_i$ drawn from $D_i$.
\subsection{Non-IID Data}
Removing the assumption of \textit{independent and identically distributed} (IID) data opens a wide range of potential practical difficulties. While non-IID simply means that a local dataset is a biased sample of the global distribution $D$, the difficulty of the learning problem depends on additional factors that compound with that bias. For example, an imbalance in the number of examples for each class represented in the global distribution compounds with the position of the nodes that have the examples of the rarest class. Additionally, if two local datasets have different number of examples, the examples in the smaller dataset will be visited more often than those in a larger dataset, potentially skewing the optimisation process to perform better on the examples seen more often.
To focus our study while still retaining the core aspects of the problem, we make the following assumptions: (1) all classes are equally represented in the global dataset, by randomly removing examples from the larger classes if necessary; (2) all classes are represented on the same number of nodes; (3) all nodes have the same number of examples. Within those assumptions, we take the hardest possible problem, which is to have each node having examples of only a single class. For the following experiments, we use the MNIST (CITE) and CIFAR10 (CITE) datasets.
\subsection{Learning Algorithm}
We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{lian2017d-psgd}, illustrated in Algorithm~\ref{Algorithm:D-PSGD}. A single step consists of sampling the local distribution $D_i$, computing and applying a stochastic gradient descent (SGD) step with regard to that sample, and averaging the model with its neighbours. Both outgoing and incoming weights of $W$ must sum to 1, i.e. $W$ is doubly stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$), and communication is symmetric, i.e. $W_{ij} = W_{ji}$.
\begin{algorithm}[h]
\caption{D-PSGD, Node $i$}
\label{Algorithm:D-PSGD}
\begin{algorithmic}[1]
\State \textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$
\For{$k = 1,\ldots, K$}
\State $s_i^{(k)} \gets \textit{sample from~} D_i$
\State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$
\State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
\EndFor
\end{algorithmic}
\end{algorithm}
D-PSGD can be used with a variety of models, including deep learning networks. To remove the impact of particular architectural choices on our results, we use a linear classifier (CITE). This model provides up to 92.5\% accuracy when fully converged on MNIST (CITE), about 7\% less than state-of-the-art deep learning networks (CITE).
%\subsection{Clustering}
%
%From the perspective of one node, \textit{clustering} intuitively represents how many connections exist between its immediate neighbours. A high level of clustering means that neighbours have many edges between each other. The highest level is a \textit{clique}, where all nodes in the neighbourhood are connected to one another. Formally, the level of clustering, between $0$ and $1$, is the ratio of $\frac{\textit{nb edges between neighbours}}{\textit{nb possible edges}}$~\cite{watts2000small}.
%
\section{D-Cliques}
Three Main ideas:
\begin{itemize}
\item Create cliques such that the clique distribution is representative of the global distribution
\item Connect cliques, based on level of "redundancy" in the datasets
\item Decouple gradient averaging from weight averaging
\end{itemize}
\subsection{Creating Representative Cliques}
The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.
The global distribution of classes, for classification tasks, can be computed from the distribution of class examples on the nodes, with Distributed Averaging (CITE). Given the global distribution of classes, neighbours within cliques can be chosen based on a PeerSampling (CITE) service. Both services can be implemented such that they converge in a logarithmic number of steps compared to the number of nodes. It is therefore possible to obtain this information in a scalable way.
In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed.
\subsection{Connecting Cliques}
The \textit{"redundancy"} of the data, i.e. how much each additional example in the training set contributes to the final accuracy, influences the minimum number of connections required between cliques to reach a given convergence speed. It needs to be evaluated empirically on a learning task. In effect, redundancy is the best parallelization factor as the more redundant the dataset is, the less nodes need to communicate. For the following arguments, $n$ is the number of nodes and $c$ is the size of a clique.
For highly redundant datasets, it may be sufficient to arrange cliques in a ring. This is not specific to D-Cliques, it is also the case with IID nodes but it is nonetheless useful to be kept in mind for D-Cliques also. In this case, the number of edges will be $O(nc + \frac{n}{c})$ and therefore linear in the number of nodes $n$.
For cases with limited redundancy, nodes can be arranged such that they are at most 2 hops away from any other nodes in the network to quickly propagate updates in the network. In effect, this is equivalent to fully connecting cliques (instead of nodes). In this case, the number of edges will be $O(nc + \frac{n^2}{c^2})$ and therefore still exponential in the number of nodes but with a strong reduction in the number of edges when $c$ is large compared to $n$ (ex: $c \geq \frac{n}{100}$).
In between, there might be enough redundancy in the dataset to arrange cliques in a fractal/hierarchical pattern such that the maximum number of hops between nodes grows logarithmically with $n$. TODO: Complexity argument.
\subsection{Decoupling Gradient Averaging from Weight Averaging}
Inter-clique connections create sources of bias in regular D-PSGD with Metropolis-Hasting (CITE):
\begin{itemize}
\item Non-uniform weights in neighbourhood for nodes not connected to other cliques
\item Non-uniform class representations in nodes connected to other cliques
\end{itemize}
TODO: Figure illustrating problem
We solve this problem by decoupling the gradient averaging from the weight averaging by sending each in separate rounds of messages.
TODO: New (minor) algorithm version of D-PSGD
\section{Applications}
\subsection{MNIST and Linear Model}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{figures/10-cliques-validation-accuracy}
\caption{\label{fig:d-cliques-mnist-linear} D-Cliques with Linear Model on MNIST.}
\end{figure}
TODO: Update figure to use decoupled gradient averaging (will probably reduce variance and accelerate convergence speed)
\subsection{CIFAR10 and Convolutional Model}
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-training-loss}
\caption{\label{fig:d-cliques-cifar10-training-loss} Training Loss}
\end{subfigure}
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-validation-accuracy}
\caption{\label{fig:d-cliques-cifar10-validation-accuracy} Validation Accuracy}
\end{subfigure}
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques with Convolutional Network on CIFAR10.}
\end{figure}
\section{Evaluation}
\subsection{Effect of Scaling}
\subsection{Comparison to similar topologies}
\begin{itemize}
\item Uniform Diverse Neighbourhood with No Clustering
\item Random network
\item Random Small-World Graph
\end{itemize}
\subsection{Relaxing Clique Connectivity}
\section{Conclusion}
\section{Credits}
%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{main}
\end{document}
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment