start reorganize sec3-4

5513a7bc · aurelien.bellet · 487fd44c · 5513a7bc · 5513a7bc · 5513a7bc
Commit 5513a7bc authored 3 years ago by aurelien.bellet
--- a/mlsys2022style/d-cliques.tex
+++ b/mlsys2022style/d-cliques.tex
 % !TEX root = main.tex

-\section{D-Cliques: Creating Locally Representative Cliques}
+\section{D-Cliques}
 \label{section:d-cliques}

-In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
-The colors of a node represent the different classes present in its local
-dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
-node has examples of all classes in equal proportions. In the non-IID setting 
-(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
-a
-single class and nodes are distributed randomly in the grid.
+In this section, we introduce D-Cliques, a topology
+designed to compensate for data heterogeneity. We also present some
+modifications of D-SGD that leverage some properties of the proposed
+topology and allow to implement a successful momentum scheme.

-A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
-In the IID case, since gradients are computed from examples of all classes,
+\subsection{Intuition}
+
+To give the intuition behind
+our approach, let us consider the neighborhood of a single node in a grid
+topology similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented
+on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
+Nodes are distributed randomly in the grid and the colors of a node represent
+the proportion of each class in its local dataset. In the homogeneous
+setting, the label distribution is the same across
+nodes: in the example shown in Figure~\ref{fig:grid-iid-neighbourhood}, all classes
+are represented in equal proportions. This is not the case in the
+heterogeneous setting: in the
+extreme case of label distribution skew shown in
+Figure~\ref{fig:grid-non-iid-neighbourhood}, each
+node actually holds examples of a single class.
+
+From the point of view of the center node, a single training step of D-SGD is
+equivalent to sampling a mini-batch five times larger from the union of the
+local distributions of all illustrated nodes.
+In the homogeneous case, since gradients are computed from examples of all
+classes,
 the resulting averaged gradient  points in a direction that tends to reduce
-the loss across all classes. In contrast, in the non-IID case, only a subset
+the loss across all classes. In contrast, in the heterogeneous case, only a
+subset
 of classes are
 represented in the immediate neighborhood of the node, thus the gradients will
 be biased towards these classes.
 Importantly, as the distributed averaging algorithm takes several steps to
 converge, this variance persists across iterations as the locally computed
-gradients are far from the global average.\footnote{It is possible, but
-very costly, to mitigate this by performing a sufficiently large number of
-averaging steps between each gradient step.} This can significantly slow down
+gradients are far from the global average.\footnote{One could perform a
+sufficiently large number of
+averaging steps between each gradient step, but this is too costly in
+practice.} This can significantly slow down
 convergence speed to the point of making decentralized optimization
 impractical.

@@ -31,47 +49,61 @@ impractical.
     \begin{subfigure}[b]{0.18\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/grid-iid-neighbourhood}
-\caption{\label{fig:grid-iid-neighbourhood} IID}
+\caption{\label{fig:grid-iid-neighbourhood} Homogeneous data}
     \end{subfigure}
+     \hspace*{.5cm}
     \begin{subfigure}[b]{0.18\textwidth}
         \centering
         \includegraphics[width=\textwidth]{../figures/grid-non-iid-neighbourhood}
-\caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
+\caption{\label{fig:grid-non-iid-neighbourhood}  Heterogeneous data}
     \end{subfigure}
-        \caption{Neighborhood in an IID and non-IID grid.}
+        \caption{Neighborhood in a grid.}
        \label{fig:grid-iid-vs-non-iid-neighbourhood}
 \end{figure}

-In D-Cliques, we address the issues of non-iidness by carefully designing a
-network topology composed of \textit{cliques} and \textit{inter-clique
+In D-Cliques, we address label distribution skew by
+carefully designing a
+network topology composed of \textit{locally representative cliques} and 
+\textit{sparse inter-clique
 connections}.

-First, D-Cliques recover a balanced representation of classes, close to
- that of the IID case, by constructing a topology such that each node $i \in N$ is
- part of a \textit{clique} $C$ such that the clique distribution $D_C =
- \bigcup\limits_{\substack{i \in C}} D_i$ is close to that of the global
- distribution $D = \bigcup\limits_{\substack{i \in N}} D_i$. We measure the
- closeness of $D_C$ to $D$ using its \textit{skew}, i.e. the sum of the
- differences in the probabilities that a sample $(x,y)$ belongs to a given class in $D_C$ and $D$: 
+\subsection{Constructing Locally Representative Cliques}
+
+D-Cliques constructs a topology in which each node is part of a \emph{clique} 
+(i.e., a subset of nodes such that the induced subgraph is fully connected)
+such that the label distribution in each clique is
+close to the global label distribution. Formally, for a label $y$ and a
+clique composed of nodes $C\subseteq N$, we denote by $p_C(y)=
+\frac{1}{|C|}\sum_{i\in C} p_i(y)$ the distribution of $y$ in $C$
+and by $p(y)=\frac{1}{n}\sum_{i\in N} p_i(y)$ its global distribution.
+We measure the \textit{skew} of $C$ by the sum
+of the absolute differences of $p_C(y)$ and $p(y)$:
 \begin{equation}
 \label{eq:skew}
-\begin{split}
        \textit{skew}(C) =\
-    \sum\limits_{\substack{l=1}}^L | p(y = l~|(x,y) \sim D_C) - \\ p(y = l~|
-    (x,y) \sim D) |
-\end{split}
+    \sum_{l=1}^L | p_C(y = l) - p(y = l) |.
 \end{equation}

-Second, to ensure a global consensus and convergence, 
- \textit{inter-clique connections}
- are introduced by connecting a small number of node pairs that are
- part of  different cliques. In the following, we introduce up to one inter-clique 
- connection per node such that each clique has exactly one
-edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
-corresponding D-Cliques network in the case of $n=100$ nodes and $L=10$
-classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.

-We construct D-Cliques by initializing cliques at random, using at most $M$
+
+% D-Cliques constructs a topology in which each node $i \in N$ is
+% part of a \textit{clique} $C$ such that the clique distribution $D_C =
+% \bigcup\limits_{\substack{i \in C}} D_i$ is close to that of the global
+% distribution $D = \bigcup\limits_{\substack{i \in N}} D_i$. We measure the
+% closeness of $D_C$ to $D$ using its \textit{skew}, i.e. the sum of the
+% differences in the probabilities that a sample $(x,y)$ belongs to a given class in $D_C$ and $D$: 
+% \begin{equation}
+% \label{eq:skew}
+% \begin{split}
+%         \textit{skew}(C) =\
+%     \sum\limits_{\substack{l=1}}^L | p(y = l~|(x,y) \sim D_C) - \\ p(y = l~|
+%     (x,y) \sim D) |
+% \end{split}
+% \end{equation}
+
+To efficiently construct a set of cliques with small skew, we propose
+Greedy-Swap (Algorithm~\ref{Algorithm:D-Clique-Construction}).
+We start by initializing cliques at random, using at most $M$
 nodes to limit the intra-clique communication costs, then we 
 swap nodes between pairs of cliques chosen at random such that the swap
 decreases the skew of that pair but keeps
@@ -84,10 +116,10 @@ knowledge of these distributions, which can easily be obtained by
 decentralized averaging in a pre-processing step. 

 \begin{algorithm}[h]
-   \caption{D-Cliques Construction: Greedy Swap}
-   \label{Algorithm:D-Clique-Construction}
+   \caption{D-Cliques Construction via Greedy Swap}
+   \label{Algorithm:greedy-swap}
   \begin{algorithmic}[1]
-        \STATE \textbf{Require:} Max clique size $M$, Max steps $K$,
+        \STATE \textbf{Require:} Clique size $M$, Max steps $K$,
        \STATE Set of all nodes $N = \{ 1, 2, \dots, n \}$,
        \STATE $\textit{skew}(S)$: skew of subset $S \subseteq N$ compared to the global distribution (Eq.~\ref{eq:skew}), 
        \STATE $\textit{intra}(DC)$: edges within cliques $C \in DC$,
@@ -118,6 +150,41 @@ decentralized averaging in a pre-processing step.
   \end{algorithmic}
 \end{algorithm}

+% \begin{algorithm}[h]
+%    \caption{D-Cliques Construction: Greedy Swap}
+%    \label{Algorithm:D-Clique-Construction}
+%    \begin{algorithmic}[1]
+%         \STATE \textbf{Require:} Max clique size $M$, Max steps $K$,
+%         \STATE Set of all nodes $N = \{ 1, 2, \dots, n \}$,
+%         \STATE $\textit{skew}(S)$: skew of subset $S \subseteq N$ compared to the global distribution (Eq.~\ref{eq:skew}), 
+%         \STATE $\textit{intra}(DC)$: edges within cliques $C \in DC$,
+%         \STATE $\textit{inter}(DC)$: edges between $C_1,C_2 \in DC$ (Sec.~\ref{section:interclique-topologies}),
+%          \STATE $\textit{weights}(E)$: set weights to edges in $E$ (Eq.~\ref{eq:metro}).
+%          \STATE ~~
+%          \STATE $DC \leftarrow []$ \COMMENT{Empty list}
+%          \WHILE {$N \neq \emptyset$}
+%          \STATE $C \leftarrow$ sample $M$ nodes from $N$ at random
+%          \STATE $N \leftarrow N \setminus C$; $DC.append(C)$
+%          \ENDWHILE
+%          \FOR{$k \in \{1, \dots, K\}$}
+%         \STATE $C_1,C_2 \leftarrow$ sample 2 from $DC$ at random
+%         \STATE $\textit{swaps} \leftarrow []$
+%         \FOR{$n_1 \in C_1, n_2 \in C_2$}
+%           \STATE $s \leftarrow skew(C_1) + skew(C_2)$
+%           \STATE $s' \leftarrow \textit{skew}(C_1-n_1+n_2) + \textit{skew}(C_2 -n_2+n_1)$
+%           \IF {$s' < s$}
+%             \STATE \textit{swaps}.append($(n_1, n_2)$)
+%           \ENDIF
+%         \ENDFOR
+%         \IF {\#\textit{swaps} $> 0$}
+%           \STATE $(n_1,n_2) \leftarrow$ sample 1 from $\textit{swaps}$ at random
+%           \STATE $C_1 \leftarrow C_1 - n_1 + n_2; C_2 \leftarrow C_2 - n_2 + n1$
+%         \ENDIF
+%          \ENDFOR
+%         \RETURN $(weights(\textit{intra}(DC) \cup \textit{inter}(DC)), DC)$
+%    \end{algorithmic}
+% \end{algorithm}
+
 The key idea of D-Cliques is that because the clique-level distribution $D_C$
 is representative of the global distribution $D$,
 the local models of nodes across cliques remain rather close. Therefore, a
@@ -126,7 +193,57 @@ number of edges without slowing down the convergence. Furthermore, the degree
 of each node in the network remains low and even, making the D-Cliques
 topology very well-suited to decentralized federated learning. 

-\section{Optimizing with Clique Averaging and Momentum}
+\subsection{Adding Sparse Inter-Clique Connections}
+\label{section:interclique-topologies}
+
+Second, to ensure a global consensus and convergence, 
+ \textit{inter-clique connections}
+ are introduced by connecting a small number of node pairs that are
+ part of  different cliques. In the following, we introduce up to one inter-clique 
+ connection per node such that each clique has exactly one
+edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
+corresponding D-Cliques network in the case of $n=100$ nodes and $L=10$
+classes. We will explore sparser inter-clique topologies in
+Section~\ref{section:interclique-topologies}.
+
+So far, we have used a fully-connected inter-clique topology for D-Cliques,
+which has the advantage of bounding the
+\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
+\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
+in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
+large compared to $c$.
+
+We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
+uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes 
+also scales linearly.
+We also consider another topology, which we call \textit{fractal}, that provides a
+logarithmic
+bound on the average path length. In this hierarchical scheme, 
+cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
+pair of cliques, but with only one edge between pairs of larger groups. The
+topology is built recursively such that $c$ groups will themselves form a
+larger group at the next level up. This results in at most $c$ edges per node 
+if edges are evenly distributed: i.e., each group within the same level adds 
+at most $c-1$ edges to other groups, leaving one node per group with $c-1$ 
+edges that can receive an additional edge to connect with other groups at the next level.
+Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
+the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
+
+Second, we look at another scheme 
+in which the number of edges scales in a near, but not quite, linear fashion.
+We propose to connect cliques according to a
+small-world-like topology~\cite{watts2000small} applied on top of a
+ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
+ring. Then each clique adds symmetric edges, both clockwise and
+counter-clockwise on the ring, with the $m$ closest cliques in sets of
+cliques that are exponentially bigger the further they are on the ring (see
+Algorithm~\ref{Algorithm:Smallworld} in the appendix for
+details on the construction). This ensures a good connectivity with other
+cliques that are close on the ring, while still keeping the average
+path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
+therefore grows in the order of $O(n\log(n))$ with the number of nodes.
+
+\subsection{Optimizing with Clique Averaging and Momentum}
 \label{section:clique-averaging-momentum}

 In this section, we present Clique Averaging. This feature, when added to D-SGD,
@@ -134,7 +251,7 @@ removes the bias caused by the inter-cliques edges of
 D-Cliques. We also show how it can be used to successfully implement momentum
 for non-IID data.

-\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
+\subsubsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
 \label{section:clique-averaging}

 While limiting the number of inter-clique connections reduces the
@@ -193,7 +310,7 @@ averaging step as in the original version.
 \end{algorithm}


-\subsection{Implementing Momentum with Clique Averaging}
+\subsubsection{Implementing Momentum with Clique Averaging}
 \label{section:momentum}

 Efficiently training high capacity models usually requires additional
@@ -219,42 +336,6 @@ It then suffices to modify the original gradient step to use momentum:
 \theta_i^{(k-\frac{1}{2})} \leftarrow \theta_i^{(k-1)} - \gamma v_i^{(k)}.
 \end{equation}

-\section{Scaling the Interclique Topology}
-\label{section:interclique-topologies}
-
-So far, we have used a fully-connected inter-clique topology for D-Cliques,
-which has the advantage of bounding the
-\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
-\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
-in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
-large compared to $c$.
-
-We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
-uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes 
-also scales linearly.
-We also consider another topology, which we call \textit{fractal}, that provides a
-logarithmic
-bound on the average path length. In this hierarchical scheme, 
-cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
-pair of cliques, but with only one edge between pairs of larger groups. The
-topology is built recursively such that $c$ groups will themselves form a
-larger group at the next level up. This results in at most $c$ edges per node 
-if edges are evenly distributed: i.e., each group within the same level adds 
-at most $c-1$ edges to other groups, leaving one node per group with $c-1$ 
-edges that can receive an additional edge to connect with other groups at the next level.
-Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
-the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
+% \section{Scaling the Interclique Topology}
+% \label{section:interclique-topologies}

-Second, we look at another scheme 
-in which the number of edges scales in a near, but not quite, linear fashion.
-We propose to connect cliques according to a
-small-world-like topology~\cite{watts2000small} applied on top of a
-ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
-ring. Then each clique adds symmetric edges, both clockwise and
-counter-clockwise on the ring, with the $m$ closest cliques in sets of
-cliques that are exponentially bigger the further they are on the ring (see
-Algorithm~\ref{Algorithm:Smallworld} in the appendix for
-details on the construction). This ensures a good connectivity with other
-cliques that are close on the ring, while still keeping the average
-path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
-therefore grows in the order of $O(n\log(n))$ with the number of nodes.
--- a/mlsys2022style/intro.tex
+++ b/mlsys2022style/intro.tex
@@ -117,7 +117,7 @@ Specifically, we make the following contributions:
 (1) We propose D-Cliques, a sparse topology in which nodes are organized in
 interconnected cliques, i.e. locally fully-connected sets of nodes, such that
 the joint label distribution of each clique is close to that of the global 
-distribution; (2) We design a greedy algorithm for
+distribution; (2) We design a randomized greedy algorithm for
 constructing such cliques efficiently;
 % in the presence of heterogeneity previously studied
 % in the context of Federated Learning~\cite{mcmahan2016communication};

--- a/mlsys2022style/setting.tex
+++ b/mlsys2022style/setting.tex
@@ -86,7 +86,7 @@ can remove this equation}
   \caption{D-SGD, Node $i$}
   \label{Algorithm:D-PSGD}
   \begin{algorithmic}[1]
-        \STATE \textbf{Require:} initial model parameters $\theta_i^{(0)}$,
+        \STATE \textbf{Require:} initial model $\theta_i^{(0)}$,
        learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
        number of steps $K$
        \FOR{$k = 1,\ldots, K$}