Skip to content
Snippets Groups Projects
Commit 0c409353 authored by Erick Lavoie's avatar Erick Lavoie
Browse files

Started refactoring of experiments section

parent 5323f5e1
No related branches found
No related tags found
No related merge requests found
% !TEX root = main.tex % !TEX root = main.tex
\appendix \appendix
\section{Detailed Algorithms}
We present a more detailed and precise explanation the algorithm to establish a small-world
inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}).
% \subsection{D-Cliques Construction}
%
% Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
% for constructing a D-Cliques topology in the non-IID case.\footnote{An IID
% version of D-Cliques, in which each node has an equal number of examples of
% all classes, can be implemented by picking $\#L$ nodes per clique at random.}
% It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
% \begin{algorithm}[h]
% \caption{D-Cliques Construction}
% \label{Algorithm:D-Clique-Construction}
% \begin{algorithmic}[1]
% \State \textbf{Require:} set of classes globally present $L$,
% \State~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
% \State~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
% \State~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
% \State~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
% \State~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$
%
% \State $R \leftarrow \{ n~\text{for}~n \in N \}$ \Comment{Remaining nodes}
% \State $DC \leftarrow \emptyset$ \Comment{D-Cliques}
% \State $\textit{C} \leftarrow \emptyset$ \Comment{Current Clique}
% \While{$R \neq \emptyset$}
% \State $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
% \State $R \leftarrow R \setminus \{ n \}$
% \State $C \leftarrow C \cup \{ n \}$
% \If{$\textit{classes}(C) = L$}
% \State $DC \leftarrow DC \cup \{ C \}$
% \State $C \leftarrow \emptyset$
% \EndIf
% \EndWhile
% \State \Return $(weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC)), DC)$
% \end{algorithmic}
%\end{algorithm}
The implementation builds a single clique by adding nodes with different
classes until all classes of the global distribution are represented. Each
clique is built sequentially until all nodes are parts of cliques.
Because all classes are represented on an equal number of nodes, all cliques
will have nodes of all classes. Furthermore, since nodes have examples
of a single class, we are guaranteed a valid assignment is possible in a greedy manner. After cliques are created, edges are added and weights are assigned to edges, using the corresponding input functions.
\subsection{Small-world Inter-clique Topology}
Algorithm~\ref{Algorithm:Smallworld} instantiates the function \section{Small-world Inter-clique Topology}
We present a more detailed and precise explanation the algorithm to establish a small-world
inter-clique topology (Algorithm~\ref{Algorithm:Smallworld}). Algorithm~\ref{Algorithm:Smallworld} instantiates the function
\textit{interconnect} with a \textit{interconnect} with a
small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a small-world inter-clique topology as described in Section~\ref{section:interclique-topologies}. It adds a
linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set. linear number of inter-clique edges by first arranging cliques on a ring. It then adds a logarithmic number of ``finger'' edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. ``Finger'' edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
...@@ -327,6 +283,81 @@ efficient large-scale decentralized learning in practice. ...@@ -327,6 +283,81 @@ efficient large-scale decentralized learning in practice.
% \centering % \centering
% \includegraphics[width=0.48\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-bsz} % \includegraphics[width=0.48\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-bsz}
% \caption{FCC: Constant Batch-Size} % \caption{FCC: Constant Batch-Size}
% \end{figure} % \end{figure}
\section{Additional Experiments with Extreme Node Skew}
\label{app:extreme-local-skew}
In this Section, we present additional results for similar experiments as in Section~\ref{section:evaluation} but in the presence of
\textit{extreme local class bias}: we consider that each node only has examples from a single class. This extreme partitioning case provides an upper bound on the effect of label distribution skew suggesting that D-Cliques should perform similarly or better in less extreme cases, as long as a small-enough average skew can be obtained on all cliques. In turn, this helps to provide insights on why D-Cliques work well, as well as to quantify the loss in convergence speed
that may result from using construction algorithms that generate cliques with higher skew.
\subsection{Non-IID assumptions.}
\label{section:non-iid-assumptions}
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on
the same number of nodes; (3) All nodes have the same number of examples.
While less realistic than the assumptions used Section~\ref{section:evaluation},
these assumptions are still reasonable because: (1) Global class imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology; (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions.
These assumptions do make the construction of cliques actually easier by
making it easy to build cliques that have zero skew, as shown in
Section~\ref{section:ideal-cliques}.
\subsection{Constructing Ideal Cliques}
\label{section:ideal-cliques}
Algorithm~\ref{Algorithm:D-Clique-Construction} shows the overall approach
for constructing a D-Cliques topology under the assumptions of Section~\ref{section:non-iid-assumptions}.\footnote{An IID
version of D-Cliques, in which each node has an equal number of examples of
all classes, can be implemented by picking $\#L$ nodes per clique at random.}
It expects the following inputs: $L$, the set of all classes present in the global distribution $D = \bigcup_{i \in N} D_i$; $N$, the set of all nodes; a function $classes(S)$, which given a subset $S$ of nodes in $N$ returns the set of classes in their joint local distributions ($D_S = \bigcup_{i \in S} D_i$); a function $intraconnect(DC)$, which given $DC$, a set of cliques (set of set of nodes), creates a set of edges ($\{\{i,j\}, \dots \}$) connecting all nodes within each clique to one another; a function $interconnect(DC)$, which given a set of cliques, creates a set of edges ($\{\{i,j\}, \dots \}$) connecting nodes belonging to different cliques; and a function $weigths(E)$, which given a set of edges, returns the weighted matrix $W_{ij}$. Algorithm~\ref{Algorithm:D-Clique-Construction} returns both $W_{ij}$, for use in D-SGD (Algorithm~\ref{Algorithm:D-PSGD} and~\ref{Algorithm:Clique-Unbiased-D-PSGD}), and $DC$, for use with Clique Averaging (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}).
\begin{algorithm}[h]
\caption{D-Cliques Construction}
\label{Algorithm:D-Clique-Construction}
\begin{algorithmic}[1]
\STATE \textbf{Require:} set of classes globally present $L$,
\STATE~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
\STATE~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
\STATE~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
\STATE~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
\STATE~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$
\STATE $R \leftarrow \{ n~\text{for}~n \in N \}$ \COMMENT{Remaining nodes}
\STATE $DC \leftarrow \emptyset$ \COMMENT{D-Cliques}
\STATE $\textit{C} \leftarrow \emptyset$ \COMMENT{Current Clique}
\WHILE{$R \neq \emptyset$}
\STATE $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
\STATE $R \leftarrow R \setminus \{ n \}$
\STATE $C \leftarrow C \cup \{ n \}$
\IF{$\textit{classes}(C) = L$}
\STATE $DC \leftarrow DC \cup \{ C \}$
\STATE $C \leftarrow \emptyset$
\ENDIF
\ENDWHILE
\RETURN $(weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC)), DC)$
\end{algorithmic}
\end{algorithm}
The implementation builds a single clique by adding nodes with different
classes until all classes of the global distribution are represented. Each
clique is built sequentially until all nodes are parts of cliques.
Because all classes are represented on an equal number of nodes, all cliques
will have nodes of all classes. Furthermore, since nodes have examples
of a single class, we are guaranteed a valid assignment is possible in a greedy manner.
After cliques are created, edges are added and weights are assigned to edges,
using the corresponding input functions.
\subsection{Evaluation}
\label{section:ideal-cliques-evaluation}
\ No newline at end of file
...@@ -217,4 +217,44 @@ v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)}. ...@@ -217,4 +217,44 @@ v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)}.
It then suffices to modify the original gradient step to use momentum: It then suffices to modify the original gradient step to use momentum:
\begin{equation} \begin{equation}
\theta_i^{(k-\frac{1}{2})} \leftarrow \theta_i^{(k-1)} - \gamma v_i^{(k)}. \theta_i^{(k-\frac{1}{2})} \leftarrow \theta_i^{(k-1)} - \gamma v_i^{(k)}.
\end{equation} \end{equation}
\ No newline at end of file
\section{Scaling the Interclique Topology}
\label{section:interclique-topologies}
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which has the advantage of bounding the
\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
\frac{n}{c}(\frac{n}{c} - 1)$ inter-clique edges, which scales quadratically
in the number of nodes $n$ for a given clique size $c$\footnote{We consider \textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
large compared to $c$.
We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
uses $\frac{2n}{c}$ inter-clique edges but its average path length between nodes
also scales linearly.
We also consider another topology, which we call \textit{fractal}, that provides a
logarithmic
bound on the average path length. In this hierarchical scheme,
cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $c$ groups will themselves form a
larger group at the next level up. This results in at most $c$ edges per node
if edges are evenly distributed: i.e., each group within the same level adds
at most $c-1$ edges to other groups, leaving one node per group with $c-1$
edges that can receive an additional edge to connect with other groups at the next level.
Since nodes have at most $c$ edges, $n$ nodes have at most $nc$ edges, therefore
the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
Second, we look at another scheme
in which the number of edges scales in a near, but not quite, linear fashion.
We propose to connect cliques according to a
small-world-like topology~\cite{watts2000small} applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $m$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average
path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n\log(n))$ with the number of nodes.
% !TEX root = main.tex % !TEX root = main.tex
\section{Experiments} \section{Evaluation}
\label{section:non-clustered} \label{section:evaluation}
In this section, we first compare D-Cliques to alternative topologies to %In this section, we first compare D-Cliques to alternative topologies to
confirm the relevance of our main design choices. Then, %confirm the relevance of our main design choices. Then,
we evaluate some extensions of D-Cliques to further reduce the number of %we evaluate some extensions of D-Cliques to further reduce the number of
inter-clique connections so as to gracefully scale with the number of %inter-clique connections so as to gracefully scale with the number of
nodes. %nodes.
\subsection{Methodology} \todo{EL: Revise intro to section}
\subsubsection{Non-IID assumptions.} \subsection{Experimental setup.}
\label{section:non-iid-assumptions}
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of label distribution skew}: we
consider that each node only has examples from a single class.
To isolate the effect of label distribution skew from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on
the same number of nodes; (3) All nodes have the same number of examples.
We believe that these assumptions are reasonable in the context of our study
because: (1)
Global class
imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology; (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions. Our results can be extended to support additional compounding factors in future work.
\subsubsection{Experimental setup.}
\label{section:experimental-settings} \label{section:experimental-settings}
Our main goal is to provide a fair comparison of the convergence speed across Our main goal is to provide a fair comparison of the convergence speed across
...@@ -54,6 +31,16 @@ balanced: we use 45k/50k images of the original training set for training, ...@@ -54,6 +31,16 @@ balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring 5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy. prediction accuracy.
We use the non-IID partitioning scheme proposed by ~\cite{mcmahan2016communication}
in their seminal Federated Learning paper for MNIST on both MNIST and CIFAR10:
i.e., we sort all training examples by class, then split that list in shards of
equal size, and distribute the shards to nodes randomly such that each node will receive two shards.
When the number of examples of one class does not divide evenly in shards, as is the case for MNIST, some
shards may have examples of more than one class and therefore nodes may have examples
of up to 4 classes. However, most nodes will have examples of 2 classes. The varying number
of classes, as well as the varying distribution of examples within a single node, makes the task
of creating cliques with low skew non-trivial.
We We
use a logistic regression classifier for MNIST, which use a logistic regression classifier for MNIST, which
provides up to 92.5\% accuracy in the centralized setting. provides up to 92.5\% accuracy in the centralized setting.
...@@ -88,7 +75,7 @@ node. In both cases, the topology has no effect on ...@@ -88,7 +75,7 @@ node. In both cases, the topology has no effect on
the optimization. For a certain choice of number of nodes and the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent. mini-batch size, both approaches are equivalent.
\subsection{Main Results} \subsection{D-Cliques match the Convergence Speed of Fully-Connected with a Fraction of the Edges}
\begin{figure}[t] \begin{figure}[t]
\centering \centering
...@@ -124,49 +111,7 @@ compared to a fully-connected topology. Nonetheless, there is still ...@@ -124,49 +111,7 @@ compared to a fully-connected topology. Nonetheless, there is still
significant variance in the accuracy across nodes, which is due to the bias significant variance in the accuracy across nodes, which is due to the bias
introduced by inter-clique edges. We address this issue in the next section. introduced by inter-clique edges. We address this issue in the next section.
\subsection{Importance of Clique Averaging and Momentum} \subsection{D-Cliques Converge Faster than Random Graphs}
% To regenerate figure, from results/mnist:
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
\begin{figure}[t]
\centering
\includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
significantly reduces the variance of models across nodes and accelerates
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
\begin{figure}[t]
\centering
% To regenerate figure, from results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted'
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-effect}
\caption{\label{fig:d-cliques-cifar10-momentum-non-iid-effect} Without Clique Averaging }
\end{subfigure}
\hfill
% To regenerate figure, from results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted'
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect}
\caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging}
\end{subfigure}
\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
\end{figure}
As shown in
Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect},
the use of Clique Averaging restores the benefits of momentum and closes the gap
with the centralized setting.
\subsection{Comparing D-Cliques to Other Sparse Topologies}
We demonstrate the advantages of D-Cliques over alternative sparse topologies We demonstrate the advantages of D-Cliques over alternative sparse topologies
that have a similar number of edges. First, we consider topologies in which that have a similar number of edges. First, we consider topologies in which
...@@ -242,6 +187,52 @@ Overall, these results show that achieving fast convergence on non-IID ...@@ -242,6 +187,52 @@ Overall, these results show that achieving fast convergence on non-IID
data with sparse topologies requires a very careful design, as we have data with sparse topologies requires a very careful design, as we have
proposed with D-Cliques. proposed with D-Cliques.
\subsection{Cliques built with Greedy Swap Converge Significantly Faster than Random Cliques}
\subsection{Clique Averaging and Momentum are Necessary}
% To regenerate figure, from results/mnist:
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes d-cliques non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png --linestyles 'solid' 'dashed' 'dotted'
\begin{figure}[t]
\centering
\includegraphics[width=0.35\textwidth]{../figures/d-clique-mnist-clique-avg}
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, Clique Averaging
significantly reduces the variance of models across nodes and accelerates
convergence to reach the same level as the one obtained with a
fully-connected topology. Note that Clique Averaging induces a small
additional cost, as gradients
and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$.
\begin{figure}[t]
\centering
% To regenerate figure, from results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted'
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-effect}
\caption{\label{fig:d-cliques-cifar10-momentum-non-iid-effect} Without Clique Averaging }
\end{subfigure}
\hfill
% To regenerate figure, from results/cifar10
% python ../../../../Software/non-iid-topology-simulator/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png --ymax 100 --linestyles 'solid' 'dashed' 'dotted'
\begin{subfigure}[b]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect}
\caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging}
\end{subfigure}
\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
\end{figure}
As shown in
Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect},
the use of Clique Averaging restores the benefits of momentum and closes the gap
with the centralized setting.
\subsection{Full Intraclique Connectivity is Necessary}
\begin{figure*}[t] \begin{figure*}[t]
\centering \centering
...@@ -280,59 +271,15 @@ proposed with D-Cliques. ...@@ -280,59 +271,15 @@ proposed with D-Cliques.
\caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity} \caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity}
\end{figure*} \end{figure*}
\subsection{Scaling up D-Cliques with Sparser Inter-Clique Topologies} \subsection{D-Cliques appear to Scale with Sparser Inter-Clique Topologies}
\label{section:interclique-topologies}
So far, we have used a fully-connected inter-clique topology for D-Cliques,
which has the advantage of bounding the
\textit{path length}\footnote{The \textit{path length} is the number of edges on the path with the shortest number of edges between two nodes.} to $3$ between any pair of nodes. This choice requires $
\frac{n}{L}(\frac{n}{L} - 1)$ inter-clique edges, which scales quadratically
in the number of nodes $n$ for a given clique size $L$\footnote{We consider
\textit{directed} edges in the analysis: the number of undirected edges is half and does not affect asymptotic behavior.}. This can become significant at larger scales when $n$ is
large compared to $L$.
In this last series of experiments, we evaluate the effect of choosing sparser In this last series of experiments, we evaluate the effect of choosing sparser
inter-clique topologies on the convergence speed for a larger network of 1000 inter-clique topologies on the convergence speed for a larger network of 1000
nodes. We compare the scalability and convergence speed of several nodes. We compare the scalability and convergence speed of the several
D-Cliques variants, which all use $O(nL)$ edges D-Cliques variants introduced in Section~\ref{section:interclique-topologies}.
to create cliques as a starting point.
We first measure the convergence speed of inter-cliques topologies whose number of edges scales linearly with the number of nodes. Among those, the \textit{ring} has the (almost) fewest possible number of edges: it
uses $\frac{2n}{L}$ inter-clique edges but its average path length between
nodes
also scales linearly.
We also consider another topology, which we call \textit{fractal}, that provides a
logarithmic
bound on the average path length. In this hierarchical scheme,
cliques are assembled in larger groups of $L$ cliques that are connected
internally with one edge per
pair of cliques, but with only one edge between pairs of larger groups. The
topology is built recursively such that $L$ groups will themselves form a
larger group at the next level up. This results in at most $L$ edges per node
if edges are evenly distributed: i.e., each group within the same level adds
at most $c-1$ edges to other groups, leaving one node per group with $L-1$
edges that can receive an additional edge to connect with other groups at the next level.
Since nodes have at most $L$ edges, $n$ nodes have at most $nL$ edges,
therefore
the number of edges in this fractal scheme indeed scales linearly in the number of nodes.
Second, we look at another scheme
in which the number of edges scales in a near, but not quite, linear fashion.
We propose to connect cliques according to a
small-world-like topology~\cite{watts2000small} applied on top of a
ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a
ring. Then each clique adds symmetric edges, both clockwise and
counter-clockwise on the ring, with the $m$ closest cliques in sets of
cliques that are exponentially bigger the further they are on the ring (see
Algorithm~\ref{Algorithm:Smallworld} in the appendix for
details on the construction). This ensures a good connectivity with other
cliques that are close on the ring, while still keeping the average
path length small. This scheme uses $\frac{n}{c}*2(m)\log(\frac{n}{c})$ inter-clique edges and
therefore grows in the order of $O(n\log(n))$ with the number of nodes.
Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence Figure~\ref{fig:d-cliques-cifar10-convolutional} shows the convergence
speed of all the above schemes on MNIST and CIFAR10, compared to the ideal speed of all sparse inter-clique topologies on MNIST and CIFAR10, compared to the ideal
baseline baseline
of a of a
single IID node performing the same number of updates per epoch (representing single IID node performing the same number of updates per epoch (representing
...@@ -390,7 +337,11 @@ show that D-Cliques can nicely scale with the number of nodes. ...@@ -390,7 +337,11 @@ show that D-Cliques can nicely scale with the number of nodes.
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.} \caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
\end{figure*} \end{figure*}
\subsection{Cost of Constructing Cliques} \subsection{Good Cliques can be Constructed Efficiently}
\label{section:cost-cliques} \label{section:cost-cliques}
\subsubsection{Quality of Construction (Skew)}
\subsubsection{Cost of Construction}
\dots \todo{EL: Add plots showing convergence speed in terms of skew vs iteration number, as well as absolute computation time} \dots \todo{EL: Add plots showing convergence speed in terms of skew vs iteration number, as well as absolute computation time}
\ No newline at end of file
...@@ -42,7 +42,7 @@ ...@@ -42,7 +42,7 @@
\begin{document} \begin{document}
\twocolumn[ \twocolumn[
\mlsystitle{D-Cliques: Compensating NonIIDness with Topology in Decentralized \mlsystitle{D-Cliques: Compensating Data Heterogeneity with Topology in Decentralized
Federated Learning} Federated Learning}
% It is OKAY to include author information, even for blind % It is OKAY to include author information, even for blind
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment