Skip to content
Snippets Groups Projects
main.tex 86.2 KiB
Newer Older
Erick Lavoie's avatar
Erick Lavoie committed
\documentclass[runningheads]{llncs}
%
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{soul}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
Erick Lavoie's avatar
Erick Lavoie committed
\usepackage{dsfont}
\usepackage{caption}
\usepackage{subcaption}

Erick Lavoie's avatar
Erick Lavoie committed
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following line
% to display URLs in blue roman font according to Springer's eBook style:
% \renewcommand\UrlFont{\color{blue}\rmfamily}

\begin{document}
aurelien.bellet's avatar
aurelien.bellet committed
\title{D-Cliques: Compensating NonIIDness in Decentralized Federated Learning
with Topology}
Erick Lavoie's avatar
Erick Lavoie committed
%
\titlerunning{D-Cliques}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Aur\'elien Bellet\inst{1} \and
Anne-Marie Kermarrec\inst{2} \and
Erick Lavoie\inst{2}\thanks{Authors in alphabetical order of last names.}}
\authorrunning{A. Bellet, A-M. Kermarrec, E. Lavoie}
%% First names are abbreviated in the running head.
%% If there are more than two authors, 'et al.' is used.
%%
\institute{Inria, Lille, France\\
\email{aurelien.bellet@inria.fr} \and
EPFL, Lausanne, Switzerland \\
\email{\{anne-marie.kermarrec,erick.lavoie\}@epfl.ch}\\
}
Erick Lavoie's avatar
Erick Lavoie committed
%
\maketitle              % typeset the header of the contribution
%
\begin{abstract}
aurelien.bellet's avatar
aurelien.bellet committed
The convergence speed of machine learning models trained with Federated
Learning is significantly affected by non-independent and identically
distributed (non-IID) data partitions, even more so in a fully decentralized
setting without a central server. In this paper, we show that the impact of
aurelien.bellet's avatar
aurelien.bellet committed
\textit{local class bias}, an important type of data non-IIDness, can be
significantly reduced by carefully designing
aurelien.bellet's avatar
aurelien.bellet committed
the underlying communication topology. We present D-Cliques, a novel topology
that reduces gradient bias by grouping nodes in interconnected cliques such
that the local joint distribution in a clique is representative of the global
class distribution. We also show how to adapt the updates of decentralized SGD
aurelien.bellet's avatar
aurelien.bellet committed
to obtain unbiased gradients and implement an effective momentum with
Erick Lavoie's avatar
Erick Lavoie committed
D-Cliques. Our empirical evaluation on MNIST and CIFAR10 demonstrates that our approach
aurelien.bellet's avatar
aurelien.bellet committed
provides similar convergence speed as a fully-connected topology with a
significant reduction in the number of edges and messages. In a 1000-node
topology, D-Cliques requires 98\% less edges and 96\% less total messages,
with further possible gains using a small-world topology across cliques.
Erick Lavoie's avatar
Erick Lavoie committed

\keywords{Decentralized Learning \and Federated Learning \and Topology \and
Non-IID Data \and Stochastic Gradient Descent}
Erick Lavoie's avatar
Erick Lavoie committed
\end{abstract}
%
%
%
\section{Introduction}

Erick Lavoie's avatar
Erick Lavoie committed
Machine learning is currently shifting from a \emph{centralized}
paradigm, in which models are trained on data located on a single machine or
in a data center, to \emph{decentralized} ones.
aurelien.bellet's avatar
aurelien.bellet committed
Effectively, the latter paradigm closely matches the natural data distribution
in the numerous use-cases where data is collected and processed by several
independent
parties (hospitals, companies, personal devices...).
Federated Learning (FL) allows a set
aurelien.bellet's avatar
aurelien.bellet committed
of participants to collaboratively train machine learning models
aurelien.bellet's avatar
aurelien.bellet committed
data while keeping it where it has been produced. Not only does this avoid
the costs of moving data, but it also  mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
aurelien.bellet's avatar
aurelien.bellet committed
Yet, working with natural data distributions introduces new challenges for
learning systems, as
aurelien.bellet's avatar
aurelien.bellet committed
local datasets
reflect the usage and production patterns specific to each participant: they are
\emph{not} independent and identically distributed
aurelien.bellet's avatar
aurelien.bellet committed
(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
aurelien.bellet's avatar
aurelien.bellet committed
across local datasets \cite{kairouz2019advances,quagmire}.
Therefore, one of the key challenges in FL is to design algorithms that
aurelien.bellet's avatar
aurelien.bellet committed
can efficiently deal with such non-IID data distributions
\cite{kairouz2019advances,fedprox,scaffold,quagmire}.

Federated learning algorithms can be classified into two categories depending
aurelien.bellet's avatar
aurelien.bellet committed
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
aurelien.bellet's avatar
aurelien.bellet committed
(\emph{clients}) and sending back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary network topology
aurelien.bellet's avatar
aurelien.bellet committed
where participants communicate only with their direct neighbors
in the network. A classic example of such algorithms is Decentralized
SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
local SGD updates and model averaging with neighboring nodes.

aurelien.bellet's avatar
aurelien.bellet committed
In this paper, we focus on fully decentralized algorithms as they can
generally scale better to the large number of participants seen in ``cross-device''
applications \cite{kairouz2019advances}. Effectively, while a central
server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree 
aurelien.bellet's avatar
aurelien.bellet committed
\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically 
aurelien.bellet's avatar
aurelien.bellet committed
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids do not significantly affect the convergence
aurelien.bellet's avatar
aurelien.bellet committed
speed compared to using denser topologies.
aurelien.bellet's avatar
aurelien.bellet committed
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
Erick Lavoie's avatar
Erick Lavoie committed
in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that  a ring or
aurelien.bellet's avatar
aurelien.bellet committed
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they exhibit \textit{local class bias}. We stress the fact
that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:

\textit{Can we design sparse topologies with  convergence
  speed similar to the one obtained in a  fully connected network under
  a large number of participants with local class bias?}
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
\begin{figure}[t]
Erick Lavoie's avatar
Erick Lavoie committed
     \centering
Erick Lavoie's avatar
Erick Lavoie committed
     
     % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py ring/iid/all/2021-03-30-16:07:06-CEST ring/non-iid/all/2021-03-30-16:07:03-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/ring-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
Erick Lavoie's avatar
Erick Lavoie committed
     \begin{subfigure}[b]{0.31\textwidth}
Erick Lavoie's avatar
Erick Lavoie committed
         \centering
         \includegraphics[width=\textwidth]{figures/ring-IID-vs-non-IID}
aurelien.bellet's avatar
aurelien.bellet committed
\caption{\label{fig:ring-IID-vs-non-IID} Ring}
Erick Lavoie's avatar
Erick Lavoie committed
     \end{subfigure}
Erick Lavoie's avatar
Erick Lavoie committed
     \quad
    % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py grid/iid/all/2021-03-30-16:07:01-CEST grid/non-iid/all/2021-03-30-16:06:59-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/grid-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
Erick Lavoie's avatar
Erick Lavoie committed
     \begin{subfigure}[b]{0.31\textwidth}
Erick Lavoie's avatar
Erick Lavoie committed
         \centering
         \includegraphics[width=\textwidth]{figures/grid-IID-vs-non-IID}
aurelien.bellet's avatar
aurelien.bellet committed
\caption{\label{fig:grid-IID-vs-non-IID} Grid}
Erick Lavoie's avatar
Erick Lavoie committed
     \end{subfigure}
Erick Lavoie's avatar
Erick Lavoie committed
     \quad
         % From directory results/mnist
     % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/iid/all/2021-03-30-16:07:20-CEST fully-connected/all/2021-03-10-09:25:19-CET  --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/fully-connected-IID-vs-non-IID.png --font-size 20 --linestyles 'solid' 'dashed'
Erick Lavoie's avatar
Erick Lavoie committed
     \begin{subfigure}[b]{0.31\textwidth}
Erick Lavoie's avatar
Erick Lavoie committed
         \centering
         \includegraphics[width=\textwidth]{figures/fully-connected-IID-vs-non-IID}
aurelien.bellet's avatar
aurelien.bellet committed
\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected}
Erick Lavoie's avatar
Erick Lavoie committed
     \end{subfigure}
aurelien.bellet's avatar
aurelien.bellet committed
        \caption{IID vs non-IID convergence speed of decentralized SGD for
        logistic regression on
        MNIST for different topologies. Bold lines show the average test
        accuracy across nodes
        while thin lines show the minimum
        and maximum accuracy of individual nodes. While the effect of topology
        is negligible for IID data, it is very significant in the
        non-IID case. When fully-connected, both cases converge similarly. See
        Section~\ref{section:experimental-settings} for details on
        the experimental setup.}
Erick Lavoie's avatar
Erick Lavoie committed
        \label{fig:iid-vs-non-iid-problem}
\end{figure}

aurelien.bellet's avatar
aurelien.bellet committed
Specifically, we make the following contributions:
aurelien.bellet's avatar
aurelien.bellet committed
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global 
(IID) distribution; (2) We propose Clique Averaging, a  modified version of 
the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections; 
(3) We show how Clique Averaging can be used to implement unbiased momentum
aurelien.bellet's avatar
aurelien.bellet committed
that would otherwise be detrimental in the non-IID setting; (4) We 
demonstrate
aurelien.bellet's avatar
aurelien.bellet committed
through an extensive experimental study that our approach  removes the effect
aurelien.bellet's avatar
aurelien.bellet committed
of the local class bias on the MNIST~\cite{mnistWebsite} and CIFAR10~
aurelien.bellet's avatar
aurelien.bellet committed
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network;  (5) Finally, we demonstrate the scalability of our
aurelien.bellet's avatar
aurelien.bellet committed
approach by considering  up to 1000-node networks, in contrast to most
aurelien.bellet's avatar
aurelien.bellet committed
previous work on fully decentralized learning that considers only a few tens
aurelien.bellet's avatar
aurelien.bellet committed
of nodes
\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
aurelien.bellet's avatar
aurelien.bellet committed

For instance, our results show that using D-Cliques in a 1000-node network
aurelien.bellet's avatar
aurelien.bellet committed
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
thereby yielding a 96\% reduction in the total number of required messages 
(37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement
aurelien.bellet's avatar
aurelien.bellet committed
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its quasilinear scaling ($O(n \log(n))$) in $n$, the number of nodes.
aurelien.bellet's avatar
aurelien.bellet committed

The rest of this paper is organized as follows. We first present the problem
statement and our methodology (Section~\ref{section:problem}). The D-Cliques
design is presented in Section~\ref{section:d-cliques}) along with an
empirical illustration of its benefits. In
Section~\ref{section:clique-averaging-momentum}, we
show how to further reduce bias with Clique Averaging and how to use it to
implement momentum.  We present the results of our extensive experimental
aurelien.bellet's avatar
aurelien.bellet committed
study in  Section~\ref{section:non-clustered}. We review some related work in
aurelien.bellet's avatar
aurelien.bellet committed
 Section~\ref{section:related-work}, and conclude with promising directions
aurelien.bellet's avatar
aurelien.bellet committed
 for future work in Section~\ref{section:conclusion}.
Erick Lavoie's avatar
Erick Lavoie committed

\section{Problem Statement}

\label{section:problem}

aurelien.bellet's avatar
aurelien.bellet committed
We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
collaboratively solve a classification task with $c$ classes. Each node has access to a local dataset that
aurelien.bellet's avatar
aurelien.bellet committed
 follows its own local distribution $D_i$. The goal is to find a global model
 $x$ that performs well on the union of the local distributions by minimizing
 the average training loss:
Erick Lavoie's avatar
Erick Lavoie committed
\begin{equation}
aurelien.bellet's avatar
aurelien.bellet committed
\min_{x} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
{s_i \sim D_i} [F_i(x;s_i)],
Erick Lavoie's avatar
Erick Lavoie committed
\label{eq:dist-optimization-problem}
\end{equation}
where $s_i$ is a data example drawn from $D_i$ and $F_i$ is the loss function
aurelien.bellet's avatar
aurelien.bellet committed
on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(x;s_i)$ denotes  the
expected loss of model $x$ on a random example $s_i$ drawn from $D_i$.
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
To collaboratively solve Problem \eqref{eq:dist-optimization-problem}, each
node can exchange messages with its neighbors in an undirected network graph
$G(N,E)$ where $\{i,j\}\in E$ denotes an edge (communication channel)
between nodes $i$ and $j$.
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
\subsection{Training Algorithm}
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
In this work, we use the popular Decentralized Stochastic
Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
shown in Algorithm~\ref{Algorithm:D-PSGD},
aurelien.bellet's avatar
aurelien.bellet committed
a single iteration of D-SGD at node $i$ consists of sampling a mini-batch
aurelien.bellet's avatar
aurelien.bellet committed
from its local distribution
aurelien.bellet's avatar
aurelien.bellet committed
$D_i$, updating its local model $x_i$ by taking a stochastic gradient descent 
(SGD) step according to the mini-batch, and performing a weighted average of
its local model with those of its
aurelien.bellet's avatar
aurelien.bellet committed
neighbors.
aurelien.bellet's avatar
aurelien.bellet committed
This weighted average is defined by a
mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $
\{i,j\}\notin
aurelien.bellet's avatar
aurelien.bellet committed
E$. To ensure that the local models converge on average to a stationary
point
aurelien.bellet's avatar
aurelien.bellet committed
of Problem
\eqref{eq:dist-optimization-problem}, $W$
must be doubly
stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and
aurelien.bellet's avatar
aurelien.bellet committed
symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}.
aurelien.bellet's avatar
aurelien.bellet committed

\begin{algorithm}[t]
   \caption{D-SGD, Node $i$}
Erick Lavoie's avatar
Erick Lavoie committed
   \label{Algorithm:D-PSGD}
   \begin{algorithmic}[1]
aurelien.bellet's avatar
aurelien.bellet committed
        \State \textbf{Require:} initial model parameters $x_i^{(0)}$,
        learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
        number of steps $K$
Erick Lavoie's avatar
Erick Lavoie committed
        \For{$k = 1,\ldots, K$}
          \State $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
          from~} D_i$
Erick Lavoie's avatar
Erick Lavoie committed
          \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$ 
          \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
        \EndFor
   \end{algorithmic}
\end{algorithm}

\subsection{Methodology}
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
\subsubsection{Non-IID assumptions.}
\label{section:non-iid-assumptions}
aurelien.bellet's avatar
aurelien.bellet committed
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of local class bias}: we
consider that each node only has examples from a single class.
aurelien.bellet's avatar
aurelien.bellet committed
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on
the same number of nodes; (3) All nodes have the same number of examples.
aurelien.bellet's avatar
aurelien.bellet committed

We believe that these assumptions are reasonable in the context of our study
because: (1)
Global class
imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology;  (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions. Our results can be extended to support additional compounding factors in future work.
aurelien.bellet's avatar
aurelien.bellet committed

\subsubsection{Experimental setup.}
\label{section:experimental-settings}

aurelien.bellet's avatar
aurelien.bellet committed
Our main goal is to provide a fair comparison of the convergence speed across
different topologies and algorithmic variations, in order to
show that our approach
aurelien.bellet's avatar
aurelien.bellet committed
can remove much of the effect of local class bias.

We experiment with two datasets: MNIST~\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k examples from the original 60k
aurelien.bellet's avatar
aurelien.bellet committed
training set for training and validation respectively. The remaining 5k
training examples were randomly removed to ensure all 10 classes are balanced
aurelien.bellet's avatar
aurelien.bellet committed
while ensuring that the dataset is evenly divisible across 100 and 1000 nodes.
aurelien.bellet's avatar
aurelien.bellet committed
We use all 10k examples of
aurelien.bellet's avatar
aurelien.bellet committed
the test set to measure prediction accuracy. For CIFAR10, classes are evenly
aurelien.bellet's avatar
aurelien.bellet committed
balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy.

We
use a logistic regression classifier for MNIST, which
Erick Lavoie's avatar
Erick Lavoie committed
provides up to 92.5\% accuracy in the centralized setting.
aurelien.bellet's avatar
aurelien.bellet committed
For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
deep convolutional network which achieves an accuracy of $72.3\%$ in the
centralized setting.
These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a
aurelien.bellet's avatar
aurelien.bellet committed
fully decentralized setting and simple enough to configure and analyze.
Regarding hyper-parameters, we jointly optimize the learning rate and
aurelien.bellet's avatar
aurelien.bellet committed
mini-batch size on the
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
For CIFAR10, we additionally use a momentum of $0.9$.

We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
To ignore the impact of distributed execution strategies and system
optimization techniques, we report the test accuracy of all nodes (min, max,
average) as a function of the number of times each example of the dataset has
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
aurelien.bellet's avatar
aurelien.bellet committed
e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This
Erick Lavoie's avatar
Erick Lavoie committed
ensures the same number of model updates and averaging per epoch, which is
aurelien.bellet's avatar
aurelien.bellet committed
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}

Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID
aurelien.bellet's avatar
aurelien.bellet committed
node. In both cases, the topology has no effect on
aurelien.bellet's avatar
aurelien.bellet committed
the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent. 
\section{D-Cliques: Creating Locally Representative Cliques}
\label{section:d-cliques}
aurelien.bellet's avatar
aurelien.bellet committed
In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
aurelien.bellet's avatar
aurelien.bellet committed
The colors of a node represent the different classes present in its local
dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has examples of all classes in equal proportions. In the non-IID setting 
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
a
aurelien.bellet's avatar
aurelien.bellet committed
single class and nodes are distributed randomly in the grid.

A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
aurelien.bellet's avatar
aurelien.bellet committed
In the IID case, since gradients are computed from examples of all classes,
aurelien.bellet's avatar
aurelien.bellet committed
the resulting averaged gradient  points in a direction that tends to reduce
the loss across all classes. In contrast, in the non-IID case, only a subset
of classes are
represented in the immediate neighborhood of the node, thus the gradients will
be biased towards these classes.
aurelien.bellet's avatar
aurelien.bellet committed
Importantly, as the distributed averaging algorithm takes several steps to
converge, this variance persists across iterations as the locally computed
gradients are far from the global average.\footnote{It is possible, but
very costly, to mitigate this by performing a sufficiently large number of
averaging steps between each gradient step.} This can significantly slow down
convergence speed to the point of making decentralized optimization
impractical.
aurelien.bellet's avatar
aurelien.bellet committed
\begin{figure}[t]
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/grid-iid-neighbourhood}
\caption{\label{fig:grid-iid-neighbourhood} IID}
     \end{subfigure}
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
\caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
     \end{subfigure}
aurelien.bellet's avatar
aurelien.bellet committed
        \caption{Neighborhood in an IID and non-IID grid.}
        \label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}

aurelien.bellet's avatar
aurelien.bellet committed
In D-Cliques, we address the issues of non-iidness by carefully designing a
network topology composed of \textit{cliques} and \textit{inter-clique
connections}:
aurelien.bellet's avatar
aurelien.bellet committed
\begin{itemize}
Erick Lavoie's avatar
Erick Lavoie committed
 \item  D-Cliques recover a balanced representation of classes, similar to
aurelien.bellet's avatar
aurelien.bellet committed
 that of the IID case, by constructing a topology such that each node is
 part of a \textit{clique} with neighbors representing all classes.
 \item To ensure a global consensus and convergence, 
 \textit{inter-clique connections}
 are introduced by connecting a small number of node pairs that are
 part of  different cliques.
aurelien.bellet's avatar
aurelien.bellet committed
\end{itemize}
Erick Lavoie's avatar
Erick Lavoie committed
In the following, we introduce up to one inter-clique connection per node such that each clique has exactly one
aurelien.bellet's avatar
aurelien.bellet committed
edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.

aurelien.bellet's avatar
aurelien.bellet committed
The mixing matrix $W$ required by D-SGD is obtained from standard
Metropolis-Hasting weights~\cite{xiao2004fast} computed from the above
topology, namely:
\begin{equation}
  W_{ij} = \begin{cases}
aurelien.bellet's avatar
aurelien.bellet committed
    \frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq
    j \text{ and } \{i,j\}\in E,\\
   1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$, \\
   0 & \text{otherwise}.
  \end{cases}
aurelien.bellet's avatar
aurelien.bellet committed
  \label{eq:metro}
\end{equation}
aurelien.bellet's avatar
aurelien.bellet committed

We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
for a formal account of D-Cliques construction. We note that it only requires
the knowledge of the local class distribution at each node. For the sake of
Erick Lavoie's avatar
Erick Lavoie committed
simplicity, we assume that D-Cliques is constructed from the global
aurelien.bellet's avatar
aurelien.bellet committed
knowledge of these distributions, which can easily be obtained by
decentralized averaging in a pre-processing step. 

The key idea of D-Cliques is that because the clique-level distribution $D_{
\textit{clique}} = \sum_{i
\in \textit{clique}} D_i$ is representative of the global distribution,
the local models of nodes across cliques remain rather close. Therefore, a
sparse inter-clique topology can be used, significantly reducing the total
number of edges without slowing down the convergence. Furthermore, the degree
of each node in the network remains low and even, making the D-Cliques
topology very well-suited to decentralized federated learning. 

\begin{figure}[t]
aurelien.bellet's avatar
aurelien.bellet committed
    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/fully-connected-cliques}
aurelien.bellet's avatar
aurelien.bellet committed
    \caption{\label{fig:d-cliques-figure} D-Cliques (fully-connected
    cliques)}
    \end{subfigure}
    \hfill
    % To regenerate figure, from results/mnist
    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16 --linestyles 'solid' 'dashed'
aurelien.bellet's avatar
aurelien.bellet committed
    \begin{subfigure}[b]{0.54\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/d-cliques-mnist-vs-fully-connected.png}
aurelien.bellet's avatar
aurelien.bellet committed
    \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed
    on MNIST}
    \end{subfigure}
    
aurelien.bellet's avatar
aurelien.bellet committed
\caption{\label{fig:d-cliques-example} D-Cliques topology and convergence
speed on MNIST.}
aurelien.bellet's avatar
aurelien.bellet committed
Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
Erick Lavoie's avatar
Erick Lavoie committed
performance of D-Cliques on MNIST with $n=100$ nodes. Observe that the
aurelien.bellet's avatar
aurelien.bellet committed
convergence speed is
aurelien.bellet's avatar
aurelien.bellet committed
very close
to that of a fully-connected topology, and significantly better than with
a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With 
100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
compared to a fully-connected topology. Nonetheless, there is still
aurelien.bellet's avatar
aurelien.bellet committed
significant variance in the accuracy across nodes, which is due to the bias
introduced by inter-clique edges. We address this issue in the next section.
aurelien.bellet's avatar
aurelien.bellet committed
\section{Optimizing with Clique Averaging and Momentum}
aurelien.bellet's avatar
aurelien.bellet committed
\label{section:clique-averaging-momentum}
Erick Lavoie's avatar
Erick Lavoie committed
In this section, we present Clique Averaging. This feature, when added to D-SGD,
removes the bias caused by the inter-cliques edges of
D-Cliques. We also show how it can be used to successfully implement momentum
aurelien.bellet's avatar
aurelien.bellet committed
for non-IID data.
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\label{section:clique-averaging}
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
While limiting the number of inter-clique connections reduces the
aurelien.bellet's avatar
aurelien.bellet committed
amount of messages traveling on the network, it also introduces its own
bias.
Figure~\ref{fig:connected-cliques-bias} illustrates the problem on the
simple case of two cliques connected by one inter-clique edge (here,
between the green node of the left clique and the pink node of the right
aurelien.bellet's avatar
aurelien.bellet committed
clique). Let us focus on node A. With weights computed as in \eqref{eq:metro},
node A's self-weight is $\frac{12}
aurelien.bellet's avatar
aurelien.bellet committed
{110}$, the weight between A and the green node connected to B is
$\frac{10}{110}$, and
all other neighbors of A have a weight of $\frac{11}{110}$. Therefore, the
gradient at A is biased towards its own class (pink) and against the green
aurelien.bellet's avatar
aurelien.bellet committed
class. A similar bias holds for all other nodes
aurelien.bellet's avatar
aurelien.bellet committed
without inter-clique edges with respect to their respective classes. For node
B, all its edge weights (including its self-weight) are equal to $\frac{1}
{11}$. However, the green class is represented twice (once as a clique
neighbor and once from the inter-clique edge), while all other classes are
represented only once. This biases the gradient toward the green class. The
combined effect of these two sources of bias is to increase the variance
of the local models across nodes.
aurelien.bellet's avatar
aurelien.bellet committed
\begin{figure}[t]
         \centering
         \includegraphics[width=0.5\textwidth]{figures/connected-cliques-bias}
aurelien.bellet's avatar
aurelien.bellet committed
\caption{\label{fig:connected-cliques-bias} Illustrating the bias induced by
inter-clique connections (see main text).}
\end{figure}
Erick Lavoie's avatar
Erick Lavoie committed

We address this problem by adding \emph{Clique Averaging} to D-SGD
aurelien.bellet's avatar
aurelien.bellet committed
(Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), which essentially
aurelien.bellet's avatar
aurelien.bellet committed
decouples gradient averaging from model averaging. The idea is to use only the
gradients of
neighbors within the same clique to compute the average gradient,
aurelien.bellet's avatar
aurelien.bellet committed
providing an equal representation to all classes. In contrast, all neighbors'
models, including those across inter-clique edges, participate in the model
averaging step as in the original version.
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
\begin{algorithm}[t]
   \label{Algorithm:Clique-Unbiased-D-PSGD}
   \begin{algorithmic}[1]
        \State \textbf{Require} initial model parameters $x_i^{(0)}$, learning
        rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
        steps $K$
        \For{$k = 1,\ldots, K$}
          \State $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
          from~} D_i$
          \State $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}}  \nabla F(x_j^{(k-1)}; s_j^{(k)})$
          \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$ 
          \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
        \EndFor
   \end{algorithmic}
\end{algorithm}
Erick Lavoie's avatar
Erick Lavoie committed

% To regenerate figure, from results/mnist:
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
aurelien.bellet's avatar
aurelien.bellet committed
\begin{figure}[t]
         \centering
         \includegraphics[width=0.55\textwidth]{figures/d-clique-mnist-clique-avg}
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this
significantly reduces the variance of models across nodes and accelerates
aurelien.bellet's avatar
aurelien.bellet committed
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
aurelien.bellet's avatar
aurelien.bellet committed
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
\subsection{Implementing Momentum with Clique Averaging}
Erick Lavoie's avatar
Erick Lavoie committed
\label{section:momentum}
aurelien.bellet's avatar
aurelien.bellet committed
Efficiently training high capacity models usually requires additional
aurelien.bellet's avatar
aurelien.bellet committed
optimization techniques. In particular, momentum~\cite{pmlr-v28-sutskever13}
increases the magnitude of the components of the gradient that are shared
between several consecutive steps, and is critical for deep convolutional networks like
LeNet~\cite{lecun1998gradient,quagmire} to converge quickly. However, a direct
application of momentum in a non-IID setting can actually be very detrimental.
As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}
for the case of LeNet on CIFAR10 with 100 nodes, D-Cliques with momentum
aurelien.bellet's avatar
aurelien.bellet committed
even fails to converge. Not using momentum actually gives a faster
aurelien.bellet's avatar
aurelien.bellet committed
convergence, but there is a significant gap compared to the case of a single
aurelien.bellet's avatar
aurelien.bellet committed
IID node with momentum.
aurelien.bellet's avatar
aurelien.bellet committed
\begin{figure}[t]
    \centering 
    % To regenerate figure, from results/cifar10
    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET  no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum'  '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum'  --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted'         
    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-effect}
    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-effect} Without Clique Averaging }
    \end{subfigure}
    \hfill
    % To regenerate figure, from results/cifar10
    % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted' 
    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect}
    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging}
    \end{subfigure}
\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
\end{figure}
aurelien.bellet's avatar
aurelien.bellet committed
We show here that Clique Averaging (Section~\ref{section:clique-averaging})
allows us to compute an unbiased momentum from the
unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)} 
\end{equation}
It then suffices to modify the original gradient step to use momentum:
\begin{equation}
x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} 
\end{equation}
aurelien.bellet's avatar
aurelien.bellet committed
As shown in
Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}, this
aurelien.bellet's avatar
aurelien.bellet committed
simple modification restores the benefits of momentum and closes the gap
with the centralized setting.
aurelien.bellet's avatar
aurelien.bellet committed
\section{Comparative Evaluation and Extensions}
aurelien.bellet's avatar
aurelien.bellet committed
\label{section:non-clustered}

aurelien.bellet's avatar
aurelien.bellet committed
In this section, we first compare D-Cliques to alternative topologies to
aurelien.bellet's avatar
aurelien.bellet committed
confirm the relevance of our main design choices. Then,
aurelien.bellet's avatar
aurelien.bellet committed
we evaluate some extensions of D-Cliques to further reduce the number of
aurelien.bellet's avatar
aurelien.bellet committed
inter-clique connections so as to gracefully scale with the number of
nodes.
\subsection{Comparing D-Cliques to Other Sparse Topologies}
We demonstrate the advantages of D-Cliques over alternative sparse topologies
aurelien.bellet's avatar
aurelien.bellet committed
that have a similar number of edges. First, we consider topologies in which
the neighbors of each node are selected at random (hence without any clique
structure).
aurelien.bellet's avatar
aurelien.bellet committed
Specifically, for $n=100$ nodes, we
construct a random topology such that each node has exactly 10 edges, which is
aurelien.bellet's avatar
aurelien.bellet committed
similar to the average 9.9 edges of our D-Cliques topology 
(Figure~\ref{fig:d-cliques-figure}). To better understand the role of
the clique structure beyond merely ensuring class representativity among
neighbors,
we also compare to a random topology similar to the one described above except
that edges are
aurelien.bellet's avatar
aurelien.bellet committed
chosen such that each node has neighbors of all possible classes. Finally, we
also implement an analog of Clique Averaging for these random topologies,
aurelien.bellet's avatar
aurelien.bellet committed
where all nodes de-bias their gradient based on the class distribution of
their neighbors. In the latter case, since nodes do not form a clique, each
node obtains a different average gradient.
aurelien.bellet's avatar
aurelien.bellet committed

The results for MNIST and CIFAR10 are shown in
Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST,
a purely random topology has higher variance and lower convergence speed than
aurelien.bellet's avatar
aurelien.bellet committed
D-Cliques (with or without Clique Averaging), while a random topology with
aurelien.bellet's avatar
aurelien.bellet committed
class representativity performs similarly as D-Cliques without Clique
Averaging. However and perhaps surprisingly, a random topology with unbiased
gradient performs slightly worse than without it. In any case, D-Cliques with
aurelien.bellet's avatar
aurelien.bellet committed
Clique Averaging outperforms all random topologies, showing that the clique
structure has a small but noticeable effect on the average accuracy and
significantly reduces the variance across nodes in this setup.
aurelien.bellet's avatar
aurelien.bellet committed
\begin{figure}[t]
         \begin{subfigure}[b]{0.48\textwidth}
% To regenerate the figure, from directory results/mnist
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  random-10/all/2021-07-23-11:59:56-CEST  random-10-diverse/all/2021-03-17-20:28:35-CET --labels 'd-clique (fcc)' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes represented)' --add-min-max --legend 'lower right' --ymin 80 --ymax 92.5 --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png --font-size 13 --linestyles 'solid' 'dashed' 'dotted' 'dashdot'
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies}
                  \caption{MNIST with Linear Model}
         \end{subfigure}
                 \hfill                      
% To regenerate the figure, from directory results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-07-23-14:33:48-CEST  random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' --add-min-max --legend 'upper left' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png --ymax 119 --font-size 13  --linestyles 'solid' 'dashed' 'dotted' 'dashdot' 'solid' --markers '' '' '' '' 'o'
        \begin{subfigure}[b]{0.48\textwidth}
        \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}
         \caption{CIFAR10 with LeNet}
     \end{subfigure} 
 \caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies} 
\end{figure}
aurelien.bellet's avatar
aurelien.bellet committed
On the harder CIFAR10 dataset with a deep convolutional network, the
differences are much more dramatic:
D-Cliques with Clique Averaging and momentum turns out to be critical for fast
convergence.
aurelien.bellet's avatar
aurelien.bellet committed
Crucially, all random topologies fail to converge to a good solution. This
confirms that our clique structure is important to reduce variance
across nodes and improve the convergence. The difference with the previous
Erick Lavoie's avatar
Erick Lavoie committed
experiment seems to be due to both the use of a higher capacity model and to
the intrinsic characteristics of the datasets.
aurelien.bellet's avatar
aurelien.bellet committed

While the previous experiments suggest that our clique structure is
instrumental in obtaining good performance, one may wonder whether
intra-clique full connectivity is actually necessary.
Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of
aurelien.bellet's avatar
aurelien.bellet committed
a D-Cliques topology where cliques have been sparsified by randomly
removing 1 or 5 undirected edges per clique (out of 45). Strikingly, both for MNIST and
aurelien.bellet's avatar
aurelien.bellet committed
CIFAR10, removing just a single edge from the cliques has a
significant effect on the
convergence speed. On CIFAR10, it even entirely negates the
aurelien.bellet's avatar
aurelien.bellet committed
benefits of D-Cliques.

aurelien.bellet's avatar
aurelien.bellet committed
Overall, these results show that achieving fast convergence on non-IID
aurelien.bellet's avatar
aurelien.bellet committed
data with sparse topologies requires a very careful design, as we have
proposed with D-Cliques.

\begin{figure}[t]
     \centering

\begin{subfigure}[htbp]{0.48\textwidth}
     \centering   
% To regenerate the figure, from directory results/mnist
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET rm-1-edge/all/2021-03-18-17:28:27-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:28:47-CET --add-min-max --ymin 85 --ymax 92.5 --legend 'lower right' --yaxis test-accuracy --labels 'fcc, clique grad.' 'fcc -1 edge/clique, no clique grad.' 'fcc -1 edge/clique, clique grad.' --save-figure ../../figures/d-cliques-mnist-clique-clustering-fcc-minus-1-edge.png  --font-size 13  --linestyle 'solid' 'dashed' 'dotted' 
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-clique-clustering-fcc-minus-1-edge}     
\caption{\label{fig:d-cliques-mnist-clique-clustering-minus-1-edge} MNIST (-1 edge/clique)}
\end{subfigure}
\hfill
\begin{subfigure}[htbp]{0.48\textwidth}
Erick Lavoie's avatar
Erick Lavoie committed
     \centering
% To regenerate the figure, from directory results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET rm-1-edge/all/2021-03-18-17:29:58-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:30:17-CET --add-min-max --ymax 80 --legend 'upper left' --yaxis test-accuracy --labels 'fcc, clique grad.' 'fcc -1 edge/clique, no clique grad.' 'fcc -1 edge/clique, clique grad.' --save-figure ../../figures/d-cliques-cifar10-clique-clustering-fcc-minus-1-edge.png --font-size 13 --linestyle 'solid' 'dashed' 'dotted'
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-clique-clustering-fcc-minus-1-edge}
\caption{\label{fig:d-cliques-cifar10-clique-clustering-minus-1-edge} CIFAR10 (-1 edge/clique)}
\end{subfigure}

\begin{subfigure}[htbp]{0.48\textwidth}
     \centering  
% To regenerate the figure, from directory results/mnist
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET rm-5-edges/all/2021-03-18-17:29:10-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:29:36-CET --add-min-max --ymin 85 --ymax 92.5 --legend 'lower right' --yaxis test-accuracy --labels 'fcc, clique grad.' 'fcc -5 edges/clique, no clique grad.' 'fcc -5 edges/clique, clique grad.' --save-figure ../../figures/d-cliques-mnist-clique-clustering-fcc-minus-5-edges.png  --font-size 13 --linestyle 'solid' 'dashed' 'dotted'   
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-clique-clustering-fcc-minus-5-edges}     
\caption{\label{fig:d-cliques-mnist-clique-clustering-minus-5-edges} MNIST (-5 edges/clique)}
\end{subfigure}
\hfill
\begin{subfigure}[htbp]{0.48\textwidth}
     \centering
% To regenerate the figure, from directory results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET rm-5-edges/all/2021-03-18-17:30:38-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:31:04-CET --add-min-max --ymax 80 --legend 'upper left' --yaxis test-accuracy --labels 'fcc, clique grad.' 'fcc -5 edges/clique, no clique grad.'  'fcc -5 edges/clique, clique grad.' --save-figure ../../figures/d-cliques-cifar10-clique-clustering-fcc-minus-5-edges.png --font-size 13 --linestyle 'solid' 'dashed' 'dotted'
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-clique-clustering-fcc-minus-5-edges}
\caption{\label{fig:d-cliques-cifar10-clique-clustering-minus-5-edges} CIFAR10 (-5 edges/clique)}
\end{subfigure}

\caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity}
aurelien.bellet's avatar
aurelien.bellet committed
\subsection{Scaling up D-Cliques with Sparser Inter-Clique Topologies}
\label{section:interclique-topologies}
Erick Lavoie's avatar
Erick Lavoie committed

aurelien.bellet's avatar
aurelien.bellet committed
So far, we have used a fully-connected inter-clique topology for D-Cliques,
aurelien.bellet's avatar
aurelien.bellet committed
which has the advantage of bounding the
\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
aurelien.bellet's avatar
aurelien.bellet committed
\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
aurelien.bellet's avatar
aurelien.bellet committed
large compared to $c$.

aurelien.bellet's avatar
aurelien.bellet committed
In this last series of experiments, we evaluate the effect of choosing sparser
aurelien.bellet's avatar
aurelien.bellet committed
inter-clique topologies on the convergence speed for a larger network of 1000
aurelien.bellet's avatar
aurelien.bellet committed
nodes. We compare the scalability and convergence speed of several
D-Cliques variants, which all use $O(nc)$ edges
to create cliques as a starting point.

We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes 
also scales linearly.
We also consider another topology, which we call \textit{fractal}, that provides a
aurelien.bellet's avatar
aurelien.bellet committed
logarithmic
bound on the average path length. In this hierarchical scheme, 
cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
aurelien.bellet's avatar
aurelien.bellet committed
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $c$ groups will themselves form a
larger group at the next level up. This results in at most $c$ edges per node 
if edges are evenly distributed: i.e., each group within the same level adds 
at most $c-1$ edges to other groups, leaving one node per group with $c-1$ 
edges that can receive an additional edge to connect with other groups at the next level.
Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
aurelien.bellet's avatar
aurelien.bellet committed

Second, we look at another scheme 
in which the number of edges scales in a near, but not quite, linear fashion.
We propose to connect cliques according to a
aurelien.bellet's avatar
aurelien.bellet committed
small-world-like topology~\cite{watts2000small} applied on top of a
aurelien.bellet's avatar
aurelien.bellet committed
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $m$ closest cliques in sets of
aurelien.bellet's avatar
aurelien.bellet committed
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
aurelien.bellet's avatar
aurelien.bellet committed
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average
path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n\log(n))$ with the number of nodes.
aurelien.bellet's avatar
aurelien.bellet committed

aurelien.bellet's avatar
aurelien.bellet committed
Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence
speed of all the above schemes on MNIST and CIFAR10, compared to the ideal
baseline
of a
single IID node performing the same number of updates per epoch (representing
the fastest convergence speed achievable if topology had no impact). Among the linear schemes, the ring
topology converges but is much slower than our fractal scheme. Among the super-linear schemes, the small-world
topology has a convergence speed that is almost the same as with a
aurelien.bellet's avatar
aurelien.bellet committed
fully-connected inter-clique topology but with 22\% less edges
(14.5 edges on average instead of 18.9). 

While the small-world inter-clique topology shows promising scaling behaviour, the
fully-connected topology still offers
aurelien.bellet's avatar
aurelien.bellet committed
significant benefits with 1000 nodes, as it represents a 98\% reduction in the
number of edges compared to fully connecting individual nodes (18.9 edges on
average instead of 999) and a 96\% reduction in the number of messages (37.8
messages per round per node on average instead of 999). We refer to
Appendix~\ref{app:scaling} for additional results comparing the convergence
Erick Lavoie's avatar
Erick Lavoie committed
speed across different number of nodes. Overall, these results
aurelien.bellet's avatar
aurelien.bellet committed
show that D-Cliques can nicely scale with the number of nodes.
aurelien.bellet's avatar
aurelien.bellet committed
\begin{figure}[t]
     \centering
       % To regenerate the figure, from directory results/mnist
 % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET ../scaling/1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET ../scaling/1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET     --add-min-max --yaxis test-accuracy --legend 'lower right' --ymin 84 --ymax 92.5 --labels '1 node IID' 'd-cliques (fractal)' 'd-cliques (ring)'  --save-figure ../../figures/d-cliques-mnist-1000-nodes-comparison-linear.png --font-size 13 --linestyles 'solid' 'dashed' 'dotted'
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
            \includegraphics[width=\textwidth]{figures/d-cliques-mnist-1000-nodes-comparison-linear}
             \caption{\label{fig:d-cliques-mnist-1000-nodes-comparison-linear} MNIST with Linear Model: Linear Inter-clique Topologies.}
     \end{subfigure}
     \hfill
     % To regenerate the figure, from directory results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET  --add-min-max --yaxis test-accuracy --labels '1-node IID' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy-linear.png --font-size 13 --linestyles 'solid' 'dashed' 'dotted'
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy-linear}
\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy-linear}  CIFAR10 with LeNet Model: Linear Inter-clique Topologies.}
     \end{subfigure}
    
     
 % To regenerate the figure, from directory results/mnist
 % python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET ../scaling/1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET ../scaling/1000/mnist/smallworld-logn-cliques/all/2021-03-23-21:45:39-CET --add-min-max --yaxis test-accuracy --legend 'lower right' --ymin 84 --ymax 92.5 --labels '1 node IID'  'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)'  --save-figure ../../figures/d-cliques-mnist-1000-nodes-comparison-super-linear.png --font-size 13 --linestyles 'solid' 'dashed' 'dotted'
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
            \includegraphics[width=\textwidth]{figures/d-cliques-mnist-1000-nodes-comparison-super-linear}
             \caption{\label{fig:d-cliques-mnist-1000-nodes-comparison-super-linear} MNIST with Linear Model: Superlinear Inter-clique Topologies.}
     \end{subfigure}
     \hfill
     % To regenerate the figure, from directory results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET  --add-min-max --yaxis test-accuracy --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy-super-linear.png --font-size 13 --linestyles 'solid' 'dashed' 'dotted'
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy-super-linear}
\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy-super-linear}  CIFAR10 with LeNet Model: Superlinear Inter-clique Topologies.}
     \end{subfigure}
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
\end{figure}
Erick Lavoie's avatar
Erick Lavoie committed

Erick Lavoie's avatar
Erick Lavoie committed
\section{Related Work}
Erick Lavoie's avatar
Erick Lavoie committed
\label{section:related-work}
aurelien.bellet's avatar
aurelien.bellet committed
In this section, we review some related work on dealing with non-IID data in
aurelien.bellet's avatar
aurelien.bellet committed
federated learning, and on the role of topology in fully decentralized
algorithms.

\paragraph{Dealing with non-IID data in server-based FL.}
aurelien.bellet's avatar
aurelien.bellet committed
Non-IID data is not much of an issue in server-based FL if
clients send their parameters to the server after each gradient update.
Problems arise when one seeks to reduce
aurelien.bellet's avatar
aurelien.bellet committed
the number of communication rounds by allowing each participant to perform
multiple local updates, as in the popular FedAvg algorithm 
aurelien.bellet's avatar
aurelien.bellet committed
\cite{mcmahan2016communication}. Indeed, non-IID data can prevent
such algorithms from
converging to a good solution \cite{quagmire,scaffold}. This led to the design
of algorithms that are specifically designed to mitigate the impact
of non-IID data while performing
aurelien.bellet's avatar
aurelien.bellet committed
multiple local updates, using adaptive client sampling \cite{quagmire}, update
aurelien.bellet's avatar
aurelien.bellet committed
corrections \cite{scaffold} or regularization in the local objective 
\cite{fedprox}. Another direction is to embrace the non-IID scenario by
learning personalized models for each client 
\cite{smith2017federated,perso_fl_mean,maml,moreau}.
aurelien.bellet's avatar
aurelien.bellet committed
We note that recent work explores rings of server-based topologies 
\cite{tornado}, but the focus is not on dealing with non-IID data but
to make server-based FL more scalable to a large number of clients.

\paragraph{Dealing with non-IID data in fully decentralized FL.}
aurelien.bellet's avatar
aurelien.bellet committed
Non-IID data is known to negatively impact the convergence speed
aurelien.bellet's avatar
aurelien.bellet committed
of fully decentralized FL algorithms in practice \cite{jelasity}. Aside from approaches that aim to learn personalized models \cite{Vanhaesebrouck2017a,Zantedeschi2020a}, this
aurelien.bellet's avatar
aurelien.bellet committed
motivated the design of algorithms with modified updates based on variance
reduction \cite{tang18a}, momentum correction \cite{momentum_noniid},
cross-gradient
aggregation \cite{cross_gradient}, or multiple averaging steps
between updates (see \cite{consensus_distance} and references therein). These
algorithms
aurelien.bellet's avatar
aurelien.bellet committed
typically require significantly more communication and/or computation, and
have only been evaluated on small-scale networks with a few tens of
nodes.\footnote{We
aurelien.bellet's avatar
aurelien.bellet committed
also observed that \cite{tang18a} is subject to numerical
instabilities when run on topologies other than rings. When
aurelien.bellet's avatar
aurelien.bellet committed
the rows and columns of $W$ do not exactly
sum to $1$ (due to finite precision), these small differences get amplified by
the proposed updates and make the algorithm diverge.}
In contrast, D-Cliques focuses on the design of a sparse topology which is
aurelien.bellet's avatar
aurelien.bellet committed
able to compensate for the effect of non-IID data and scales to large
networks. We do not modify the simple
aurelien.bellet's avatar
aurelien.bellet committed
and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions
Erick Lavoie's avatar
Erick Lavoie committed
that otherwise bias the gradient direction.
\paragraph{Impact of topology in fully decentralized FL.} It is well
known
that the choice of network topology can affect the
aurelien.bellet's avatar
aurelien.bellet committed
convergence of fully decentralized algorithms. In theoretical convergence
rates, this is typically accounted
for by a dependence on the spectral gap of
the network, see for instance 
\cite{Duchi2012a,Colin2016a,lian2017d-psgd,Nedic18}.
However, for IID data, practice contradicts these classic
aurelien.bellet's avatar
aurelien.bellet committed
results as fully decentralized algorithms have been observed to converge
essentially as fast
on sparse topologies like rings or grids as they do on a fully connected
aurelien.bellet's avatar
aurelien.bellet committed
network \cite{lian2017d-psgd,Lian2018}. Recent work 
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the IID case. However, these results do not give any clear insight
regarding the role of the topology in the non-IID case. We note that some work
has gone into designing efficient topologies to optimize the use of
aurelien.bellet's avatar
aurelien.bellet committed
network resources (see e.g., \cite{marfoq}), but the topology is chosen
independently of how data is distributed across nodes. In summary, the role
of topology in the non-IID data scenario is not well understood and we are not
Erick Lavoie's avatar
Erick Lavoie committed
aware of prior work focusing on this question. Our work is the first
aurelien.bellet's avatar
aurelien.bellet committed
to show that an
appropriate choice of data-dependent topology can effectively compensate for
non-IID data.

\section{Conclusion}
Erick Lavoie's avatar
Erick Lavoie committed
\label{section:conclusion}
aurelien.bellet's avatar
aurelien.bellet committed
We proposed D-Cliques, a sparse topology that recovers the convergence
speed of a fully-connected network in the presence of local class bias.
D-Cliques is based on assembling subsets of nodes into cliques such
that the clique-level class distribution is representative of the global
distribution, thereby locally recovering IIDness. Cliques are joined in a
sparse inter-clique topology so that
aurelien.bellet's avatar
aurelien.bellet committed
they quickly converge to the same model. We proposed Clique
Averaging to remove the non-IID bias in gradient computation by
averaging gradients only with other nodes within the clique. Clique Averaging
can in turn be used to implement unbiased momentum to recover the convergence
speed usually only possible with IID mini-batches. Through our experiments, we
showed that the clique structure of D-Cliques is critical in obtaining these
results and that a small-world inter-clique topology with only $O(n \log (n))$ 
edges achieves the best compromise between
aurelien.bellet's avatar
aurelien.bellet committed
convergence speed and scalability with the number of nodes.

D-Cliques thus appears to be very promising to reduce bandwidth
aurelien.bellet's avatar
aurelien.bellet committed
usage on FL servers and to implement fully decentralized alternatives in a
wider range of applications where global coordination is impossible or costly.
For instance, the presence and relative frequency of classes in each node
could be computed using PushSum~\cite{kempe2003gossip}, and the topology could
be constructed in a decentralized and adaptive way with
PeerSampling~\cite{jelasity2007gossip}. This will be investigated in future work.
We also believe that our ideas can be useful to deal
with more general types of data non-IIDness beyond the important case of
local class bias that we studied in this paper. An important example is
covariate shift or feature distribution skew \cite{kairouz2019advances}, for
which local density estimates could be used as basis to construct cliques that
approximately recover the global distribution.
aurelien.bellet's avatar
aurelien.bellet committed
\section{Acknowledgments}
\label{section:acknowledgement}

aurelien.bellet's avatar
aurelien.bellet committed
This research was partially supported by French grants ANR-16-CE23-0016 
(Project PAMELA) and ANR-20-CE23-0015 (Project PRIDE), and by the European
Union's Horizon 2020 Research and Innovation Program under Grant Agreement No.
825081 COMPRISE.
Erick Lavoie's avatar
Erick Lavoie committed

%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
 \bibliographystyle{splncs04}
 \bibliography{main}
 \section{Detailed Algorithms}
 We present a more detailed and precise explanation of the two main algorithms
 of the paper, for D-Cliques construction
 (Algorithm~\ref{Algorithm:D-Clique-Construction}) and to establish a small-world
 inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}).
 \subsection{D-Cliques Construction}
 Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
 for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
 version of D-Cliques, in which each node has an equal number of examples of
 all classes, can be implemented by picking $\#L$ nodes per clique at random.}
 It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$.  Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
Erick Lavoie's avatar
Erick Lavoie committed
 
   \begin{algorithm}[h]
   \caption{D-Cliques Construction}
   \label{Algorithm:D-Clique-Construction}
   \begin{algorithmic}[1]
        \State \textbf{Require:} set of classes globally present $L$, 
        \State~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
        \State~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
        \State~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
        \State~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
         \State~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$ 
        \State $R \leftarrow \{ n~\text{for}~n \in N \}$ \Comment{Remaining nodes}
        \State $DC \leftarrow \emptyset$ \Comment{D-Cliques}
        \State $\textit{C} \leftarrow \emptyset$ \Comment{Current Clique}
        \While{$R \neq \emptyset$}
		\State $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
		\State $R \leftarrow R \setminus \{ n \}$
Erick Lavoie's avatar
Erick Lavoie committed
		\State $C \leftarrow C \cup \{ n \}$
		\If{$\textit{classes}(C) = L$}
		    \State $DC \leftarrow DC \cup \{ C \}$