Skip to content
Snippets Groups Projects
Commit 860eb39f authored by Erick Lavoie's avatar Erick Lavoie
Browse files

Removed editing comments and contribution statement

parent 154c51d8
No related branches found
No related tags found
No related merge requests found
......@@ -4,7 +4,6 @@
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
%\usepackage{amsthm}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{soul}
......@@ -25,8 +24,7 @@
% \renewcommand\UrlFont{\color{blue}\rmfamily}
\begin{document}
%
%\title{D-Cliques: Topology can compensate NonIIDness in Decentralized Federated Learning}
\title{D-Cliques: Compensating NonIIDness in Decentralized Federated Learning
with Topology}
%
......@@ -34,9 +32,9 @@ with Topology}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Aur\'elien Bellet\inst{1}\thanks{Authors in alphabetical order of last names, see Section~\ref{section:contribution-statement} for statement of respective contributions.} \and
\author{Aur\'elien Bellet\inst{1} \and
Anne-Marie Kermarrec\inst{2} \and
Erick Lavoie\inst{2}}
Erick Lavoie\inst{2}\thanks{Authors in alphabetical order of last names.}}
%%
\authorrunning{A. Bellet, A-M. Kermarrec, E. Lavoie}
%% First names are abbreviated in the running head.
......@@ -67,8 +65,6 @@ provides similar convergence speed as a fully-connected topology with a
significant reduction in the number of edges and messages. In a 1000-node
topology, D-Cliques requires 98\% less edges and 96\% less total messages,
with further possible gains using a small-world topology across cliques.
% Our study paves the way for tackling more general types of data non-IIDness
% through the design of appropriate topologies.
\keywords{Decentralized Learning \and Federated Learning \and Topology \and
Non-IID Data \and Stochastic Gradient Descent}
......@@ -78,11 +74,6 @@ Non-IID Data \and Stochastic Gradient Descent}
%
\section{Introduction}
% 1/ Decentralized FL approaches can be more scalable than Centralized FL approach when the number of nodes is large
% 2/ It is well known the topology can affect convergence of decentralized algorithms, as shown by classic convergence analysis. However the effect of topology has been observed to be often quite small in practice. This is because most of these results were obtained for iid data.
% 3/ In this paper, we show that the effect of topology is very significant for non-iid data. Unlike for centralized FL approaches, this happens even when nodes perform a single local update before averaging. We propose an approach to design a sparse data-aware topology which recovers the convergence speed of a centralized approach.
% 4/ An originality of our approach is to work at the topology level without changing the original efficient and simple D-SGD algorithm. Other work to mitigate the effect of non-iid on decentralized algorithms are based on performing modified updates (eg with variance reduction) or multiple averaging steps.
Machine learning is currently shifting from a \emph{centralized}
paradigm, in which models are trained on data located on a single machine or
in a data center, to \emph{decentralized} ones.
......@@ -126,8 +117,6 @@ enough such that all participants need only to communicate with a small number o
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids do not significantly affect the convergence
speed compared to using denser topologies.
% We also note that full decentralization can also provide benefits in terms of
% privacy protection \cite{amp_dec}.
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: We observe that a ring or
......@@ -139,14 +128,9 @@ that, unlike in centralized FL
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected network under
a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
\begin{figure}[t]
\centering
......@@ -187,7 +171,6 @@ model with their neighbors. In this paper, we address the following question:
\label{fig:iid-vs-non-iid-problem}
\end{figure}
%Indeed, as we show with the following contributions:
Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
......@@ -207,7 +190,6 @@ approach by considering up to 1000-node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
of nodes
\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average),
......@@ -226,21 +208,13 @@ implement momentum. We present the results or our extensive experimental
study in Section~\ref{section:non-clustered}. We review some related work in
Section~\ref{section:related-work}, and conclude with promising directions
for future work in Section~\ref{section:conclusion}.
%When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
%\footnotetext{This is different from the accuracy of the average model across nodes that is sometimes used once training is completed.}
\section{Problem Statement}
\label{section:problem}
We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
collaboratively solve a classification task with $c$ classes.
% where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
%AMK:explain the weight
%Training data is sampled from a global distribution $D$ unknown to the nodes.
%AMK:Removed the sentence above
Each node has access to a local dataset that
collaboratively solve a classification task with $c$ classes. Each node has access to a local dataset that
follows its own local distribution $D_i$. The goal is to find a global model
$x$ that performs well on the union of the local distributions by minimizing
the average training loss:
......@@ -259,12 +233,10 @@ $G(N,E)$ where $\{i,j\}\in E$ denotes an edge (communication channel)
between nodes $i$ and $j$.
\subsection{Training Algorithm}
%AMK: if we need space this could be a paragraph
In this work, we use the popular Decentralized Stochastic
Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
shown in Algorithm~\ref{Algorithm:D-PSGD},
%AMK: can we say why: most popular, most efficient ?
a single iteration of D-SGD at node $i$ consists of sampling a mini-batch
from its local distribution
$D_i$, updating its local model $x_i$ by taking a stochastic gradient descent
......@@ -307,13 +279,7 @@ symmetric, i.e. $W_{ij} = W_{ji}$~\cite{lian2017d-psgd}.
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of local class bias}: we
consider that each node only has examples
%examples
from a single class.
% Our results should generalize to lesser, and more
% frequent, cases.
%AMK: a bit weak can't we say our results generalize....
%: e.g., if some classes are globally less represented, the position of the nodes with the rarest classes will be significant; and if two local datasets have different number of examples, the examples in the smaller dataset may be visited more often than those in a larger dataset, skewing the optimization process.
consider that each node only has examples from a single class.
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
......@@ -328,14 +294,10 @@ affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology; (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions.
% with less examples could
% simply skip some rounds until the nodes with more examples catch up.
Our results can be extended to support additional compounding factors in future work.
functions. Our results can be extended to support additional compounding factors in future work.
\subsubsection{Experimental setup.}
\label{section:experimental-settings}
%AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval.
Our main goal is to provide a fair comparison of the convergence speed across
different topologies and algorithmic variations, in order to
......@@ -357,12 +319,9 @@ prediction accuracy.
We
use a logistic regression classifier for MNIST, which
provides up to 92.5\% accuracy in the centralized setting.
% compared to
% $99\%$ for the state-of-the-art~\cite{mnistWebsite}.
For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
deep convolutional network which achieves an accuracy of $72.3\%$ in the
centralized setting.
% compared to the 99\% achieved by start-of-the-art.
These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a
fully decentralized setting and simple enough to configure and analyze.
......@@ -389,24 +348,12 @@ Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID
node. In both cases, the topology has no effect on
the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent. %ensure a single
% model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.
mini-batch size, both approaches are equivalent.
\section{D-Cliques: Creating Locally Representative Cliques}
\label{section:d-cliques}
In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
% where each color represents a class of data.
The colors of a node represent the different classes present in its local
dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has examples of all classes in equal proportions. In the non-IID setting
......@@ -420,7 +367,7 @@ the resulting averaged gradient points in a direction that tends to reduce
the loss across all classes. In contrast, in the non-IID case, only a subset
of classes are
represented in the immediate neighborhood of the node, thus the gradients will
be biased towards these classes. % more than in the IID case.
be biased towards these classes.
Importantly, as the distributed averaging algorithm takes several steps to
converge, this variance persists across iterations as the locally computed
gradients are far from the global average.\footnote{It is possible, but
......@@ -429,9 +376,6 @@ averaging steps between each gradient step.} This can significantly slow down
convergence speed to the point of making decentralized optimization
impractical.
%For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
\begin{figure}[t]
\centering
\begin{subfigure}[b]{0.25\textwidth}
......@@ -495,9 +439,6 @@ number of edges without slowing down the convergence. Furthermore, the degree
of each node in the network remains low and even, making the D-Cliques
topology very well-suited to decentralized federated learning.
%We centrally generate the topology, which is then tested in a custom simulator. We expect our approach should be straightforward to adapt for a decentralized execution: the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}.
\begin{figure}[t]
\centering
......@@ -532,13 +473,6 @@ compared to a fully-connected topology. Nonetheless, there is still
significant variance in the accuracy across nodes, which is due to the bias
introduced by inter-clique edges. We address this issue in the next section.
%The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.
%
%The global distribution of classes, for classification tasks, can be computed from the distribution of class examples on the nodes, with Distributed Averaging (CITE). Given the global distribution of classes, neighbours within cliques can be chosen based on a PeerSampling (CITE) service. Both services can be implemented such that they converge in a logarithmic number of steps compared to the number of nodes. It is therefore possible to obtain this information in a scalable way.
%
% In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed.
\section{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum}
......@@ -673,19 +607,14 @@ with the centralized setting.
\section{Comparative Evaluation and Extensions}
\label{section:non-clustered}
%AMK: add what is in there
In this section, we first compare D-Cliques to alternative topologies to
confirm the relevance of our main design choices. Then,
we evaluate some extensions of D-Cliques to further reduce the number of
inter-clique connections so as to gracefully scale with the number of
nodes.
\subsection{Comparing D-Cliques to Other Sparse Topologies} %Non-Clustered
% Topologies}
%\label{section:non-clustered}
\subsection{Comparing D-Cliques to Other Sparse Topologies}
%We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
We demonstrate the advantages of D-cliques over alternative sparse topologies
that have a similar number of edges. First, we consider topologies in which
the neighbors of each node are selected at random (hence without any clique
......@@ -744,15 +673,10 @@ confirms that our clique structure is important to reduce variance
across nodes and improve the convergence. The difference with the previous
experiment seems to be due to both the use of a higher capacity model and to
the intrinsic characteristics of the datasets.
% We refer
% to the appendix for results on MNIST with LeNet.
% We have tried to use LeNet on
% MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
While the previous experiments suggest that our clique structure is
instrumental in obtaining good performance, one may wonder whether
intra-clique full connectivity is actually necessary.
%AMK: check sentence above: justify
Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of
a D-Cliques topology where cliques have been sparsified by randomly
removing 1 or 5 edges per clique (out of 45). Strikingly, both for MNIST and
......@@ -785,12 +709,9 @@ proposed with D-Cliques.
\end{subfigure}
\caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity}
%AMK: how many nodes?
\end{figure}
%\section{Scaling with Different Inter-Clique Topologies}
\subsection{Scaling up D-Cliques with Sparser Inter-Clique Topologies}
%with Different Inter-Clique Topologies}
\label{section:interclique-topologies}
......@@ -920,13 +841,6 @@ instabilities when run on topologies other than rings. When
the rows and columns of $W$ do not exactly
sum to $1$ (due to finite precision), these small differences get amplified by
the proposed updates and make the algorithm diverge.}
% non-IID known to be a problem for fully decentralized FL. cf Jelasity paper
% D2 and other recent papers on modifying updates: Quasi-Global Momentum,
% Cross-Gradient Aggregation
% papers using multiple averaging steps
% also our personalized papers
% D2 \cite{tang18a}: numerically unstable when $W_{ij}$ rows and columns do not exactly
% sum to $1$, as the small differences are amplified in a positive feedback loop. More work is therefore required on the algorithm to make it usable with a wider variety of topologies. In comparison, D-cliques do not modify the SGD algorithm and instead simply removes some neighbor contributions that would otherwise bias the direction of the gradient. D-Cliques with D-PSGD are therefore as tolerant to ill-conditioned $W_{ij}$ matrices as regular D-PSGD in an IID setting.
In contrast, D-Cliques focuses on the design of a sparse topology which is
able to compensate for the effect of non-IID data and scales to large
networks. We do not modify the simple
......@@ -935,13 +849,6 @@ algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions
that otherwise bias the gradient direction.
% An originality of our approach is to focus on the effect of topology
% level without significantly changing the original simple and efficient D-SGD
% algorithm \cite{lian2017d-psgd}. Other work to mitigate the effect of non-IID
% data on decentralized algorithms are based on performing modified updates (eg
% with variance reduction) or multiple averaging steps.
\paragraph{Impact of topology in fully decentralized FL.} It is well
known
that the choice of network topology can affect the
......@@ -1000,26 +907,6 @@ covariate shift or feature distribution skew \cite{kairouz2019advances}, for
which local density estimates could be used as basis to construct cliques that
approximately recover the global distribution.
%\section{Future Work}
%\begin{itemize}
% \item Non-uniform Class Representation
% \item End-to-End Wall-Clock Training Time, including Clique Formation
% \item Comparison to Shuffling Data in a Data Center
% \item Behaviour in the Presence of Churn
% \item Relaxing Clique Connectivity: Randomly choose a subset of clique neighbours to compute average gradient.
%\end{itemize}
\section{Contribution Statement}
\label{section:contribution-statement}
The following authors, listed in alphabetical order, have made the following contributions to this paper:
\begin{itemize}
\item \textbf{Aur\'elien Bellet}: Problem Identification; Conceptualization; Literature Review; Experiment Design; Result Analysis; Writing - Original Draft (Intro, Related Work), Organization, Revisions
\item \textbf{Anne-Marie Kermarrec}: Supervision; Conceptualization (Smallworld Suggestion); Project Administration; Funding Acquisition; Writing Organization and Revisions
\item \textbf{Erick Lavoie}: Conceptualization (D-Clique Insight); Investigation; Experiment Design; Software Implementation (Simulator, Logging Infrastructure, Reporting); Result Analysis; Writing - Original Draft (Rest) and Revisions
\end{itemize}
\section{Acknowledgments}
\label{section:acknowledgement}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment