Skip to content
Snippets Groups Projects
Commit 5513a7bc authored by aurelien.bellet's avatar aurelien.bellet
Browse files

start reorganize sec3-4

parent 487fd44c
No related branches found
No related tags found
No related merge requests found
% !TEX root = main.tex
\section{D-Cliques: Creating Locally Representative Cliques}
\section{D-Cliques}
\label{section:d-cliques}
In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
The colors of a node represent the different classes present in its local
dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has examples of all classes in equal proportions. In the non-IID setting
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
a
single class and nodes are distributed randomly in the grid.
In this section, we introduce D-Cliques, a topology
designed to compensate for data heterogeneity. We also present some
modifications of D-SGD that leverage some properties of the proposed
topology and allow to implement a successful momentum scheme.
A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In the IID case, since gradients are computed from examples of all classes,
\subsection{Intuition}
To give the intuition behind
our approach, let us consider the neighborhood of a single node in a grid
topology similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented
on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
Nodes are distributed randomly in the grid and the colors of a node represent
the proportion of each class in its local dataset. In the homogeneous
setting, the label distribution is the same across
nodes: in the example shown in Figure~\ref{fig:grid-iid-neighbourhood}, all classes
are represented in equal proportions. This is not the case in the
heterogeneous setting: in the
extreme case of label distribution skew shown in
Figure~\ref{fig:grid-non-iid-neighbourhood}, each
node actually holds examples of a single class.
From the point of view of the center node, a single training step of D-SGD is
equivalent to sampling a mini-batch five times larger from the union of the
local distributions of all illustrated nodes.
In the homogeneous case, since gradients are computed from examples of all
classes,
the resulting averaged gradient points in a direction that tends to reduce
the loss across all classes. In contrast, in the non-IID case, only a subset
the loss across all classes. In contrast, in the heterogeneous case, only a
subset
of classes are
represented in the immediate neighborhood of the node, thus the gradients will
be biased towards these classes.
Importantly, as the distributed averaging algorithm takes several steps to
converge, this variance persists across iterations as the locally computed
gradients are far from the global average.\footnote{It is possible, but
very costly, to mitigate this by performing a sufficiently large number of
averaging steps between each gradient step.} This can significantly slow down
gradients are far from the global average.\footnote{One could perform a
sufficiently large number of
averaging steps between each gradient step, but this is too costly in
practice.} This can significantly slow down
convergence speed to the point of making decentralized optimization
impractical.
......@@ -31,47 +49,61 @@ impractical.
\begin{subfigure}[b]{0.18\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/grid-iid-neighbourhood}
\caption{\label{fig:grid-iid-neighbourhood} IID}
\caption{\label{fig:grid-iid-neighbourhood} Homogeneous data}
\end{subfigure}
\hspace*{.5cm}
\begin{subfigure}[b]{0.18\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/grid-non-iid-neighbourhood}
\caption{\label{fig:grid-non-iid-neighbourhood} Non-IID}
\caption{\label{fig:grid-non-iid-neighbourhood} Heterogeneous data}
\end{subfigure}
\caption{Neighborhood in an IID and non-IID grid.}
\caption{Neighborhood in a grid.}
\label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}
In D-Cliques, we address the issues of non-iidness by carefully designing a
network topology composed of \textit{cliques} and \textit{inter-clique
In D-Cliques, we address label distribution skew by
carefully designing a
network topology composed of \textit{locally representative cliques} and
\textit{sparse inter-clique
connections}.
First, D-Cliques recover a balanced representation of classes, close to
that of the IID case, by constructing a topology such that each node $i \in N$ is
part of a \textit{clique} $C$ such that the clique distribution $D_C =
\bigcup\limits_{\substack{i \in C}} D_i$ is close to that of the global
distribution $D = \bigcup\limits_{\substack{i \in N}} D_i$. We measure the
closeness of $D_C$ to $D$ using its \textit{skew}, i.e. the sum of the
differences in the probabilities that a sample $(x,y)$ belongs to a given class in $D_C$ and $D$:
\subsection{Constructing Locally Representative Cliques}
D-Cliques constructs a topology in which each node is part of a \emph{clique}
(i.e., a subset of nodes such that the induced subgraph is fully connected)
such that the label distribution in each clique is
close to the global label distribution. Formally, for a label $y$ and a
clique composed of nodes $C\subseteq N$, we denote by $p_C(y)=
\frac{1}{|C|}\sum_{i\in C} p_i(y)$ the distribution of $y$ in $C$
and by $p(y)=\frac{1}{n}\sum_{i\in N} p_i(y)$ its global distribution.
We measure the \textit{skew} of $C$ by the sum
of the absolute differences of $p_C(y)$ and $p(y)$:
\begin{equation}
\label{eq:skew}
\begin{split}
\textit{skew}(C) =\
\sum\limits_{\substack{l=1}}^L | p(y = l~|(x,y) \sim D_C) - \\ p(y = l~|
(x,y) \sim D) |
\end{split}
\sum_{l=1}^L | p_C(y = l) - p(y = l) |.
\end{equation}
Second, to ensure a global consensus and convergence,
\textit{inter-clique connections}
are introduced by connecting a small number of node pairs that are
part of different cliques. In the following, we introduce up to one inter-clique
connection per node such that each clique has exactly one
edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
corresponding D-Cliques network in the case of $n=100$ nodes and $L=10$
classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
We construct D-Cliques by initializing cliques at random, using at most $M$
% D-Cliques constructs a topology in which each node $i \in N$ is
% part of a \textit{clique} $C$ such that the clique distribution $D_C =
% \bigcup\limits_{\substack{i \in C}} D_i$ is close to that of the global
% distribution $D = \bigcup\limits_{\substack{i \in N}} D_i$. We measure the
% closeness of $D_C$ to $D$ using its \textit{skew}, i.e. the sum of the
% differences in the probabilities that a sample $(x,y)$ belongs to a given class in $D_C$ and $D$:
% \begin{equation}
% \label{eq:skew}
% \begin{split}
% \textit{skew}(C) =\
% \sum\limits_{\substack{l=1}}^L | p(y = l~|(x,y) \sim D_C) - \\ p(y = l~|
% (x,y) \sim D) |
% \end{split}
% \end{equation}
To efficiently construct a set of cliques with small skew, we propose
Greedy-Swap (Algorithm~\ref{Algorithm:D-Clique-Construction}).
We start by initializing cliques at random, using at most $M$
nodes to limit the intra-clique communication costs, then we
swap nodes between pairs of cliques chosen at random such that the swap
decreases the skew of that pair but keeps
......@@ -84,10 +116,10 @@ knowledge of these distributions, which can easily be obtained by
decentralized averaging in a pre-processing step.
\begin{algorithm}[h]
\caption{D-Cliques Construction: Greedy Swap}
\label{Algorithm:D-Clique-Construction}
\caption{D-Cliques Construction via Greedy Swap}
\label{Algorithm:greedy-swap}
\begin{algorithmic}[1]
\STATE \textbf{Require:} Max clique size $M$, Max steps $K$,
\STATE \textbf{Require:} Clique size $M$, Max steps $K$,
\STATE Set of all nodes $N = \{ 1, 2, \dots, n \}$,
\STATE $\textit{skew}(S)$: skew of subset $S \subseteq N$ compared to the global distribution (Eq.~\ref{eq:skew}),
\STATE $\textit{intra}(DC)$: edges within cliques $C \in DC$,
......@@ -118,6 +150,41 @@ decentralized averaging in a pre-processing step.
\end{algorithmic}
\end{algorithm}
% \begin{algorithm}[h]
% \caption{D-Cliques Construction: Greedy Swap}
% \label{Algorithm:D-Clique-Construction}
% \begin{algorithmic}[1]
% \STATE \textbf{Require:} Max clique size $M$, Max steps $K$,
% \STATE Set of all nodes $N = \{ 1, 2, \dots, n \}$,
% \STATE $\textit{skew}(S)$: skew of subset $S \subseteq N$ compared to the global distribution (Eq.~\ref{eq:skew}),
% \STATE $\textit{intra}(DC)$: edges within cliques $C \in DC$,
% \STATE $\textit{inter}(DC)$: edges between $C_1,C_2 \in DC$ (Sec.~\ref{section:interclique-topologies}),
% \STATE $\textit{weights}(E)$: set weights to edges in $E$ (Eq.~\ref{eq:metro}).
% \STATE ~~
% \STATE $DC \leftarrow []$ \COMMENT{Empty list}
% \WHILE {$N \neq \emptyset$}
% \STATE $C \leftarrow$ sample $M$ nodes from $N$ at random
% \STATE $N \leftarrow N \setminus C$; $DC.append(C)$
% \ENDWHILE
% \FOR{$k \in \{1, \dots, K\}$}
% \STATE $C_1,C_2 \leftarrow$ sample 2 from $DC$ at random
% \STATE $\textit{swaps} \leftarrow []$
% \FOR{$n_1 \in C_1, n_2 \in C_2$}
% \STATE $s \leftarrow skew(C_1) + skew(C_2)$
% \STATE $s' \leftarrow \textit{skew}(C_1-n_1+n_2) + \textit{skew}(C_2 -n_2+n_1)$
% \IF {$s' < s$}
% \STATE \textit{swaps}.append($(n_1, n_2)$)
% \ENDIF
% \ENDFOR
% \IF {\#\textit{swaps} $> 0$}
% \STATE $(n_1,n_2) \leftarrow$ sample 1 from $\textit{swaps}$ at random
% \STATE $C_1 \leftarrow C_1 - n_1 + n_2; C_2 \leftarrow C_2 - n_2 + n1$
% \ENDIF
% \ENDFOR
% \RETURN $(weights(\textit{intra}(DC) \cup \textit{inter}(DC)), DC)$
% \end{algorithmic}
% \end{algorithm}
The key idea of D-Cliques is that because the clique-level distribution $D_C$
is representative of the global distribution $D$,
the local models of nodes across cliques remain rather close. Therefore, a
......@@ -126,7 +193,57 @@ number of edges without slowing down the convergence. Furthermore, the degree
of each node in the network remains low and even, making the D-Cliques
topology very well-suited to decentralized federated learning.
\section{Optimizing with Clique Averaging and Momentum}
\subsection{Adding Sparse Inter-Clique Connections}
\label{section:interclique-topologies}
Second, to ensure a global consensus and convergence,
\textit{inter-clique connections}
are introduced by connecting a small number of node pairs that are
part of different cliques. In the following, we introduce up to one inter-clique
connection per node such that each clique has exactly one
edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
corresponding D-Cliques network in the case of $n=100$ nodes and $L=10$
classes. We will explore sparser inter-clique topologies in
Section~\ref{section:interclique-topologies}.
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which has the advantage of bounding the
\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
large compared to $c$.
We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes
also scales linearly.
We also consider another topology, which we call \textit{fractal}, that provides a
logarithmic
bound on the average path length. In this hierarchical scheme,
cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $c$ groups will themselves form a
larger group at the next level up. This results in at most $c$ edges per node
if edges are evenly distributed: i.e., each group within the same level adds
at most $c-1$ edges to other groups, leaving one node per group with $c-1$
edges that can receive an additional edge to connect with other groups at the next level.
Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
Second, we look at another scheme
in which the number of edges scales in a near, but not quite, linear fashion.
We propose to connect cliques according to a
small-world-like topology~\cite{watts2000small} applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $m$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average
path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n\log(n))$ with the number of nodes.
\subsection{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum}
In this section, we present Clique Averaging. This feature, when added to D-SGD,
......@@ -134,7 +251,7 @@ removes the bias caused by the inter-cliques edges of
D-Cliques. We also show how it can be used to successfully implement momentum
for non-IID data.
\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\subsubsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\label{section:clique-averaging}
While limiting the number of inter-clique connections reduces the
......@@ -193,7 +310,7 @@ averaging step as in the original version.
\end{algorithm}
\subsection{Implementing Momentum with Clique Averaging}
\subsubsection{Implementing Momentum with Clique Averaging}
\label{section:momentum}
Efficiently training high capacity models usually requires additional
......@@ -219,42 +336,6 @@ It then suffices to modify the original gradient step to use momentum:
\theta_i^{(k-\frac{1}{2})} \leftarrow \theta_i^{(k-1)} - \gamma v_i^{(k)}.
\end{equation}
\section{Scaling the Interclique Topology}
\label{section:interclique-topologies}
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which has the advantage of bounding the
\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
large compared to $c$.
We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes
also scales linearly.
We also consider another topology, which we call \textit{fractal}, that provides a
logarithmic
bound on the average path length. In this hierarchical scheme,
cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $c$ groups will themselves form a
larger group at the next level up. This results in at most $c$ edges per node
if edges are evenly distributed: i.e., each group within the same level adds
at most $c-1$ edges to other groups, leaving one node per group with $c-1$
edges that can receive an additional edge to connect with other groups at the next level.
Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
% \section{Scaling the Interclique Topology}
% \label{section:interclique-topologies}
Second, we look at another scheme
in which the number of edges scales in a near, but not quite, linear fashion.
We propose to connect cliques according to a
small-world-like topology~\cite{watts2000small} applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $m$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average
path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n\log(n))$ with the number of nodes.
......@@ -117,7 +117,7 @@ Specifically, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint label distribution of each clique is close to that of the global
distribution; (2) We design a greedy algorithm for
distribution; (2) We design a randomized greedy algorithm for
constructing such cliques efficiently;
% in the presence of heterogeneity previously studied
% in the context of Federated Learning~\cite{mcmahan2016communication};
......
......@@ -86,7 +86,7 @@ can remove this equation}
\caption{D-SGD, Node $i$}
\label{Algorithm:D-PSGD}
\begin{algorithmic}[1]
\STATE \textbf{Require:} initial model parameters $\theta_i^{(0)}$,
\STATE \textbf{Require:} initial model $\theta_i^{(0)}$,
learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
number of steps $K$
\FOR{$k = 1,\ldots, K$}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment