Skip to content
Snippets Groups Projects
Commit 014a9f4d authored by aurelien.bellet's avatar aurelien.bellet
Browse files

sec 3

parent 4882470f
No related branches found
No related tags found
No related merge requests found
...@@ -262,16 +262,22 @@ In this work, we use the popular Decentralized Stochastic ...@@ -262,16 +262,22 @@ In this work, we use the popular Decentralized Stochastic
Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
shown in Algorithm~\ref{Algorithm:D-PSGD}, shown in Algorithm~\ref{Algorithm:D-PSGD},
%AMK: can we say why: most popular, most efficient ? %AMK: can we say why: most popular, most efficient ?
a single step of D-SGD at node $i$ consists of sampling a mini-batch a single iteration of D-SGD at node $i$ consists of sampling a mini-batch
from its local distribution from its local distribution
$D_i$, taking stochastic gradient descent (SGD) step according to this $D_i$, updating its local model $x_i$ by taking a stochastic gradient descent
sample, and performing a weighted average of its model with its neighbors. (SGD) step according to this
sample, and performing a weighted average of its local model with those of its
neighbors.
This weighted average is defined by a This weighted average is defined by a
mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $ the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $
\{i,j\}\notin \{i,j\}\notin
E$. To ensure that D-SGD converges to a (local) optimum, $W$ must be doubly E$. To ensure that the local models converge on average to a (local) optimum
stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and symmetric, i.e. $W_{ij} = W_{ji}$. of Problem
\eqref{eq:dist-optimization-problem}, $W$
must be doubly
stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and
symmetric, i.e. $W_{ij} = W_{ji}$, see \cite{lian2017d-psgd}.
\begin{algorithm}[t] \begin{algorithm}[t]
\caption{D-SGD, Node $i$} \caption{D-SGD, Node $i$}
...@@ -324,8 +330,9 @@ Our results can be extended to support additional compounding factors in future ...@@ -324,8 +330,9 @@ Our results can be extended to support additional compounding factors in future
\label{section:experimental-settings} \label{section:experimental-settings}
%AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval. %AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval.
Our main goal is to provide a fair comparison of the convergence speed of Our main goal is to provide a fair comparison of the convergence speed across
different topologies and algorithmic variations, to show that our approach different topologies and algorithmic variations, in order to
show that our approach
can remove much of the effect of local class bias. can remove much of the effect of local class bias.
We experiment with two datasets: MNIST~\cite{mnistWebsite} and We experiment with two datasets: MNIST~\cite{mnistWebsite} and
...@@ -391,13 +398,30 @@ mini-batch size, both approaches are equivalent. %ensure a single ...@@ -391,13 +398,30 @@ mini-batch size, both approaches are equivalent. %ensure a single
\section{D-Cliques: Creating Locally Representative Cliques} \section{D-Cliques: Creating Locally Representative Cliques}
\label{section:d-cliques} \label{section:d-cliques}
In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where each color represents a class of data. In this section, we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}.
The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has samples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. % where each color represents a class of data.
The colors of a node represent the different classes it holds
locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has samples of all classes in equal proportions. In the non-IID setting
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a
single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In the IID case, since gradients are computed from examples of all classes,
the resulting average gradient points in a direction that reduces the
loss across all classes. In contrast, in the non-IID case, only a subset of classes are
represented in the immediate neighborhood of the node and the gradients will
be biased towards these classes. % more than in the IID case.
Importantly, as the distributed averaging algorithm takes several steps to
converge, this variance persists across iterations as the locally computed
gradients are far from the global average.\footnote{It is possible, but
very costly, to mitigate this by performing a sufficiently large number of
averaging steps between each gradient step.} This can significantly slow down
convergence speed to the point of making decentralized optimization
impractical.
%For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. %For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
\begin{figure} \begin{figure}[t]
\centering \centering
\begin{subfigure}[b]{0.25\textwidth} \begin{subfigure}[b]{0.25\textwidth}
\centering \centering
...@@ -409,57 +433,91 @@ The colors of a node, represented as a circle, correspond to the different class ...@@ -409,57 +433,91 @@ The colors of a node, represented as a circle, correspond to the different class
\includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood} \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
\caption{\label{fig:grid-non-iid-neighbourhood} Non-IID} \caption{\label{fig:grid-non-iid-neighbourhood} Non-IID}
\end{subfigure} \end{subfigure}
\caption{Neighbourhood in an IID and non-IID Grid.} \caption{Neighborhood in an IID and non-IID Grid.}
\label{fig:grid-iid-vs-non-iid-neighbourhood} \label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure} \end{figure}
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented. % more than in the IID case. In D-Cliques, we address the issues of non-iidness by carefully designing a
In addition, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical. network topology composed of \textit{cliques} and \textit{inter-clique
connections}:
In D-Cliques, we address the issues of non-iidness by carefully designing the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
\begin{itemize} \begin{itemize}
\item D-Cliques recovers a balanced representation of classes, similar to that of the IID case, by modifying the topology such that each node is part of a \textit{clique} with neighbours representing all classes. \item D-Cliques recovers a balanced representation of classes, similar to
\item To ensure all cliques converge, \textit{inter-clique connections} are introduced, established directly between nodes that are part of cliques. that of the IID case, by constructing a topology such that each node is
part of a \textit{clique} with neighbors representing all classes.
\item To ensure a global consensus and convergence,
\textit{inter-clique connections}
are introduced by connecting a small number of node pairs that are
part of different cliques.
\end{itemize} \end{itemize}
Because a joint location distribution $D_{\textit{clique}} = \sum_{i \in \textit{clique}} D_i$ is representative of the global distribution, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. Because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. \footnote{See Algorithm~\ref{Algorithm:D-Clique-Construction} in Appendix for set-based formulation of D-Cliques construction.} In the following, we introduce one inter-clique connection per node such that each clique has exactly one
edge with all other cliques, see Figure~\ref{fig:d-cliques-figure} for the
Finally, weights corresponding D-Cliques network in the case of $n=100$ nodes and $c=10$
%AMK: explain weights classes. We will explore sparser inter-clique topologies in Section~\ref{section:interclique-topologies}.
are assigned to edges to ensure quick convergence. For this study we use Metropolis-Hasting weights~\cite{xiao2004fast}, which while not necessarily optimal, are quick to compute and still provide good convergence speed:
The mixing matrix $W$ required by D-SGD is obtained from the above
topology using standard
Metropolis-Hasting weights~\cite{xiao2004fast}:
\begin{equation} \begin{equation}
W_{ij} = \begin{cases} W_{ij} = \begin{cases}
\frac{1}{max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq j, \text{and $\exists$ edge between $i$ and $j$}\\ \frac{1}{\max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq
1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$ \\ j \text{ and } \{i,j\}\in E,\\
0 & \text{otherwise} 1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$, \\
0 & \text{otherwise}.
\end{cases} \end{cases}
\end{equation} \end{equation}
Note that for the sake of simplicity we assume that the topology is generated while assuming a global knowledge of the class distribution. Relaxing this assumption is part of future work.
We refer to Algorithm~\ref{Algorithm:D-Clique-Construction} in the appendix
for a formal account of D-Cliques construction. We note that it only requires
the knowledge of the local class distribution at each node. For the sake of
simplicity, we assume that D-Cliques is constructed from the global
knowledge of these distributions, which can easily be obtained by
decentralized averaging in a pre-processing step.
The key idea of D-Cliques is that because the clique-level distribution $D_{
\textit{clique}} = \sum_{i
\in \textit{clique}} D_i$ is representative of the global distribution,
the local models of nodes across cliques remain rather close. Therefore, a
sparse inter-clique topology can be used, significantly reducing the total
number of edges without slowing down the convergence. Furthermore, the degree
of each node in the network remains low and even, making the D-Cliques
topology very well-suited to decentralized federated learning.
%We centrally generate the topology, which is then tested in a custom simulator. We expect our approach should be straightforward to adapt for a decentralized execution: the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}. %We centrally generate the topology, which is then tested in a custom simulator. We expect our approach should be straightforward to adapt for a decentralized execution: the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}.
\begin{figure}[htbp] \begin{figure}[t]
\centering \centering
\begin{subfigure}[b]{0.4\textwidth} \begin{subfigure}[b]{0.45\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{figures/fully-connected-cliques} \includegraphics[width=\textwidth]{figures/fully-connected-cliques}
\caption{\label{fig:d-cliques-figure} D-Cliques Connected Pairwise} \caption{\label{fig:d-cliques-figure} D-Cliques (fully-connected
cliques)}
\end{subfigure} \end{subfigure}
\hfill \hfill
% To regenerate figure, from results/mnist % To regenerate figure, from results/mnist
% python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16 % python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16
\begin{subfigure}[b]{0.55\textwidth} \begin{subfigure}[b]{0.54\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{figures/d-cliques-mnist-vs-fully-connected.png} \includegraphics[width=\textwidth]{figures/d-cliques-mnist-vs-fully-connected.png}
\caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed on MNIST. Y-axis starts at 80.} \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed
on MNIST}
\end{subfigure} \end{subfigure}
\caption{\label{fig:d-cliques-example} D-Cliques} \caption{\label{fig:d-cliques-example} D-Cliques topology and convergence
speed on MNIST.}
\end{figure} \end{figure}
A network of 100 non-IID nodes with D-Cliques is illustrated in Figure~\ref{fig:d-cliques-figure}, with the convergence speed of Figure~\ref{fig:d-cliques-example-convergence-speed}. The convergence speed is quite close to that of a fully-connected topology, and significantly better than that of the ring and grid of Figure~\ref{fig:iid-vs-non-iid-problem}. At a scale of 100 nodes, it uses only $\approx10\%$ of the number of edges of a fully-connected topology, offering a reduction of $\approx90\%$. Nonetheless, there is still significant variance in accuracy between nodes, which we address in the next section by removing the bias actually introduced by inter-clique edges. Figure~\ref{fig:d-cliques-example-convergence-speed} illustrates the
performance D-Cliques on MNIST with $n=100$ nodes. The convergence speed is
very close
to that of a fully-connected topology, and significantly better than with
a ring or a grid (see Figure~\ref{fig:iid-vs-non-iid-problem}). With
100 nodes, it offers a reduction of $\approx90\%$ in the number of edges
compared to a fully-connected topology. Nonetheless, there is still
significant variance in the accuracy across nodes, which we address in
the next section by removing the bias introduced by inter-clique edges.
%The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. %The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment