In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
The colors of a node represent the different classes present in its local
dataset. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has examples of all classes in equal proportions. In the non-IID setting
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
a
single class and nodes are distributed randomly in the grid.
In this section, we introduce D-Cliques, a topology
designed to compensate for data heterogeneity. We also present some
modifications of D-SGD that leverage some properties of the proposed
topology and allow to implement a successful momentum scheme.
A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In the IID case, since gradients are computed from examples of all classes,
\subsection{Intuition}
To give the intuition behind
our approach, let us consider the neighborhood of a single node in a grid
topology similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented
on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
Nodes are distributed randomly in the grid and the colors of a node represent
the proportion of each class in its local dataset. In the homogeneous
setting, the label distribution is the same across
nodes: in the example shown in Figure~\ref{fig:grid-iid-neighbourhood}, all classes
are represented in equal proportions. This is not the case in the
heterogeneous setting: in the
extreme case of label distribution skew shown in
Figure~\ref{fig:grid-non-iid-neighbourhood}, each
node actually holds examples of a single class.
From the point of view of the center node, a single training step of D-SGD is
equivalent to sampling a mini-batch five times larger from the union of the
local distributions of all illustrated nodes.
In the homogeneous case, since gradients are computed from examples of all
classes,
the resulting averaged gradient points in a direction that tends to reduce
the loss across all classes. In contrast, in the non-IID case, only a subset
the loss across all classes. In contrast, in the heterogeneous case, only a
subset
of classes are
represented in the immediate neighborhood of the node, thus the gradients will
be biased towards these classes.
Importantly, as the distributed averaging algorithm takes several steps to
converge, this variance persists across iterations as the locally computed
gradients are far from the global average.\footnote{It is possible, but
very costly, to mitigate this by performing a sufficiently large number of
averaging steps between each gradient step.} This can significantly slow down
gradients are far from the global average.\footnote{One could perform a
sufficiently large number of
averaging steps between each gradient step, but this is too costly in
practice.} This can significantly slow down
convergence speed to the point of making decentralized optimization
Second, to ensure a global consensus and convergence,
\textit{inter-clique connections}
are introduced by connecting a small number of node pairs that are
part of different cliques. In the following, we introduce up to one inter-clique
connection per node such that each clique has exactly one
edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
corresponding D-Cliques network in the case of $n=100$ nodes and $L=10$
classes. We will explore sparser inter-clique topologies in
Section~\ref{section:interclique-topologies}.
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which has the advantage of bounding the
\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
\frac{n}{c}(\frac{n}{c}-1)$ inter-clique edges, which scales quadratically
in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
large compared to $c$.
We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes
also scales linearly.
We also consider another topology, which we call \textit{fractal}, that provides a
logarithmic
bound on the average path length. In this hierarchical scheme,
cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $c$ groups will themselves form a
larger group at the next level up. This results in at most $c$ edges per node
if edges are evenly distributed: i.e., each group within the same level adds
at most $c-1$ edges to other groups, leaving one node per group with $c-1$
edges that can receive an additional edge to connect with other groups at the next level.
Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
Second, we look at another scheme
in which the number of edges scales in a near, but not quite, linear fashion.
We propose to connect cliques according to a
small-world-like topology~\cite{watts2000small} applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $m$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average
path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n\log(n))$ with the number of nodes.
\subsection{Optimizing with Clique Averaging and Momentum}
\label{section:clique-averaging-momentum}
In this section, we present Clique Averaging. This feature, when added to D-SGD,
...
...
@@ -134,7 +251,7 @@ removes the bias caused by the inter-cliques edges of
D-Cliques. We also show how it can be used to successfully implement momentum
for non-IID data.
\subsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\subsubsection{Clique Averaging: Debiasing Gradients from Inter-Clique Edges}
\label{section:clique-averaging}
While limiting the number of inter-clique connections reduces the
...
...
@@ -193,7 +310,7 @@ averaging step as in the original version.
\end{algorithm}
\subsection{Implementing Momentum with Clique Averaging}
\subsubsection{Implementing Momentum with Clique Averaging}
\label{section:momentum}
Efficiently training high capacity models usually requires additional
...
...
@@ -219,42 +336,6 @@ It then suffices to modify the original gradient step to use momentum:
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which has the advantage of bounding the
\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
\frac{n}{c}(\frac{n}{c}-1)$ inter-clique edges, which scales quadratically
in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
large compared to $c$.
We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes
also scales linearly.
We also consider another topology, which we call \textit{fractal}, that provides a
logarithmic
bound on the average path length. In this hierarchical scheme,
cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $c$ groups will themselves form a
larger group at the next level up. This results in at most $c$ edges per node
if edges are evenly distributed: i.e., each group within the same level adds
at most $c-1$ edges to other groups, leaving one node per group with $c-1$
edges that can receive an additional edge to connect with other groups at the next level.
Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
% \section{Scaling the Interclique Topology}
% \label{section:interclique-topologies}
Second, we look at another scheme
in which the number of edges scales in a near, but not quite, linear fashion.
We propose to connect cliques according to a
small-world-like topology~\cite{watts2000small} applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $m$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average
path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n\log(n))$ with the number of nodes.