\title{D-Cliques: Topology can compensate NonIIDness in Decentralized Federated Learning}
%\title{D-Cliques: Topology can compensate NonIIDness in Decentralized Federated Learning}
\title{D-Cliques: Compensating NonIIDness in Decentralized Federated Learning with topology}
%
\titlerunning{D-Cliques}
% If the paper title is too long for the running head, you can set
...
...
@@ -67,37 +68,32 @@ Non-IID Data \and Stochastic Gradient Descent}
Machine learning is currently shifting from a \emph{centralized}
paradigm, in which models are trained on data located on a single machine or
in a data center, to \emph{decentralized} ones.
Indeed, such paradigm matches the natural data distribution as data is collected by several independent
parties (hospitals, companies, personal devices...) and trained on participants' devices.
Effectively, such paradigm matches the natural data distribution as data is collected by several independent parties (hospitals, companies, personal devices...) and trained on participants' devices.
Federated Learning (FL) allows a set
of data owners to collaboratively train machine learning models
on their joint
data while keeping it where it has been produced, thereby avoiding the costs of moving
data as well as mitigating privacy and confidentiality concerns
\cite{kairouz2019advances}.
Yet, such a data distribution brings another challenge to learning systems, as local datasets reflect the usage and production patterns peculiar to each participant: they are
data while keeping it where it has been produced. Not only, this avoids the costs of moving data but it also mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
Yet, such a data distribution challenges learning systems, as local datasets reflect the usage and production patterns specific to each participant: they are
\emph{not} independent and identically distributed
(non-IID). In particular, the relative frequency of different classes of examples may significantly vary
(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
across local datasets \cite{quagmire}.
Therefore, one of the key challenges in FL is to design algorithms that
Federated learning algorithms can be classified into two categories depending
on the network topology they work on. In server-based FL, the network is
organized according to a star topology: a central server orchestrates the training process and
on the underlying network topology they run on. In server-based FL, the network is organized according to a star topology: a central server orchestrates the training process and
iteratively aggregates model updates received from the participants
(\emph{clients}) and sends
them back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary topology where
participants communicate only with their direct neighbors
in the network graph. A classic example of such algorithms is Decentralized
in the underlying communcation graph. A classic example of such algorithms is Decentralized
SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
local SGD updates and model averaging with neighboring nodes.
In this work, we focus on fully decentralized algorithms as they can
generally
scale better to the large number of participants seen in ``cross-device''
In this paper, we focus on fully decentralized algorithms as they can
generally scale better to the large number of participants seen in ``cross-device''
applications \cite{kairouz2019advances}. Effectively, while a central
server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree
...
...
@@ -108,25 +104,31 @@ speed compared to using denser topologies with IID data.
% We also note that full decentralization can also provide benefits in terms of
% privacy protection \cite{amp_dec}.
In contrast to the IID case, we show that \emph{the impact of
topology is very significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: using a ring or
grid topology completely jeopardizes the convergence speed local distributions do not have relative frequency of classes similar to the global distribution, i.e. they have \textit{local class bias}. We stress that, unlike for centralized FL
In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremly significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that a ring or
a grid topology clearly jeopardizes the convergence speed as local distributions do not have relative frequency of classes similar to the global distribution, i.e. they have \textit{local class bias}. We also observe that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. We thus study the following question:
model with their neighbors. In this paper, we address the following question:
% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}
\textit{Are there sparse topologies with similar convergence
speed as the fully connected graph under a large number of participants with
local class bias?}
%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}
Indeed, as we show with the following contributions: (1) we propose D-Cliques, a sparse topology in which nodes are organized in cliques, i.e. locally fully-connected sets of nodes, such that the joint data distribution of each clique is representative of the global (IID) distribution; (2) we propose Clique Averaging, a simple modification to the standard D-SGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure all models converge, therefore reducing the bias introduced by inter-clique connections; (3) we show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in a non-IID setting; (4) we show the previous techniques to indeed remove the effect of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~\cite{krizhevsky2009learning} datasets, with a linear and deep convolutional network; and (5) we show these results to hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
\textit{Can we design sparse topologies with convergence
speed similar to the one obtained in a fully connected graph communication network under a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?
With 1000 participants, the resulting design requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) and a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
The rest of this paper is organized as follows. We first present the problem statement and methodology (Section~\ref{section:problem}). When then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
%Indeed, as we show with the following contributions:
In this paper, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in cliques, i.e. locally fully-connected sets of nodes, such that the joint data distribution of each clique is representative of the global (IID) distribution; (2) We propose Clique Averaging, a modified version of the standard D-SGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure all models converge, therefore reducing the bias introduced by inter-clique connections; (3) We show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in a non-IID setting; (4) We demonstrate though an extensive experimental study that our approach removes the effect of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~\cite{krizhevsky2009learning} datasets, with a linear and deep convolutional network; (5) Finally, we demonstrate the scalability of our approach by considering up to 1000 node networks, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants~\cite{tang18a,more_refs}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}.
For instance, our results show that using D-Cliques in a 1000 node network requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement (14.5 edges per node on average instead of 18.9) is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.
The rest of this paper is organized as follows. We first present the problem statement and methodology (Section~\ref{section:problem}). The D-Cliques design is presented in Section~\ref{section:d-cliques}) along with its benefit. In Section~\ref{section:clique-averaging}, we show how to further reduce bias with Clique Averaging and how to use it to implement momentum. We present the results or our extensive experimental study in Section~\ref{section:non-clustered}). Related work is surveyed in Section~\ref{section:related-work}) before we conclude in (Section~\ref{section:conclusion}.
%When then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).
\begin{figure}
\centering
...
...
@@ -166,9 +168,13 @@ The rest of this paper is organized as follows. We first present the problem sta
\label{section:problem}
A set of $n$ nodes $N =\{1, \dots, n \}$ communicates with their neighbours defined by the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij}=0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
We consider a set of $n$ nodes $N =\{1, \dots, n \}$ where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij}=0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
%AMK:explain the weight
Training data is sampled from a global distribution $D$ unknown to the nodes. Each node has access to an arbitrary partition of the samples that follows its own local distribution $D_i$. Nodes cooperate to reach consensus on a global model $M$ that performs well on $D$ by minimizing the average training loss on local models:
%Training data is sampled from a global distribution $D$ unknown to the nodes.
%AMK:Removed the sentence above
We assume that each node has access to an arbitrary partition of the samples that follows its own local distribution $D_i$. Nodes cooperate to% reach consensus
converge on a global model $M$ that performs well on $D$ by minimizing the average training loss on local models:
@@ -181,8 +187,11 @@ on node $i$, and $\mathds{E}_{s_i \sim D_i} F_i(x_i;s_i)$ denotes the
expected value of $F_i$ on a random sample $s_i$ drawn from $D_i$.
\subsection{Learning Algorithm}
%AMK: if we need space this could be a paragraph
We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{lian2017d-psgd}, illustrated in Algorithm~\ref{Algorithm:D-PSGD}. A single step consists of sampling the local distribution $D_i$, computing and applying a stochastic gradient descent (SGD) step on that sample, and averaging the model with its neighbours. Both outgoing and incoming weights of $W$ must sum to 1, i.e. $W$ is doubly stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$), and communication is symmetric, i.e. $W_{ij}= W_{ji}$.
We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{lian2017d-psgd}, illustrated in Algorithm~\ref{Algorithm:D-PSGD}.
%AMK: can we say why: most popular, most efficient ?
A single step consists of sampling the local distribution $D_i$, computing and applying a stochastic gradient descent (SGD) step on that sample, and averaging the model with its neighbours. Both outgoing and incoming weights of $W$ must sum to 1, i.e. $W$ is doubly stochastic ($\sum_{j \in N} W_{ij}=1$ and $\sum_{j \in N} W_{ji}=1$), and communication is symmetric, i.e. $W_{ij}= W_{ji}$.
\begin{algorithm}[h]
\caption{D-PSGD, Node $i$}
...
...
@@ -201,16 +210,21 @@ We use the Decentralized-Parallel Stochastic Gradient Descent, aka D-PSGD~\cite{
\subsubsection{Non-IID Assumptions}
\label{section:non-iid-assumptions}
Removing the assumption of IID data opens many potential difficulties. In this paper, we focus on an \textit{extreme case of local class bias}, namely, when each node has examples of a single class. Our results should generalize to lesser, and more frequent, cases.
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, removing the assumption of IID data significantly challenges the learning algorithm. In this paper, we focus on an \textit{extreme case of local class bias}, and consider each node to have only samples
%examples
of a single class. Our results should generalize to lesser, and more frequent, cases.
%AMK: a bit weak can't we say our results generalize....
%: e.g., if some classes are globally less represented, the position of the nodes with the rarest classes will be significant; and if two local datasets have different number of examples, the examples in the smaller dataset may be visited more often than those in a larger dataset, skewing the optimization process.
To isolate the effect of local class bias from other potentially compounding factors, we make the following assumptions: (1) all classes are equally represented in the global dataset, by randomly removing examples from the larger classes if necessary; (2) all classes are represented on the same number of nodes; (3) all nodes have the same number of examples.
To isolate the effect of local class bias from other potentially compounding factors, we make the following assumptions: (1) All classes are equally represented in the global dataset, by randomly removing examples from the larger classes if necessary; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.
These assumptions are reasonable because: (1) global class imbalance equally affects the optimization process on a single node and is therefore not specific to a decentralized setting; (2) our results do not leverage specific positions in the topology; (3) nodes with less examples could simply skip some rounds until the nodes with more examples catch up. Our results can therefore be extended to support additional compounding factors in future work.
These assumptions are reasonable because: (1) Global class imbalance equally affects the optimization process on a single node and is therefore not specific to a decentralized setting; (2) Our results do not exploit specific positions in the topology; (3) Nodes with less examples could simply skip some rounds until the nodes with more examples catch up. Our results can therefore be extended to support additional compounding factors in future work.
\subsubsection{Experimental Settings}
\label{section:experimental-settings}
%AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval.
We focus on fairly comparing the convergence speed of different topologies and algorithm variations, to show that our approach can remove much of the effect of local class bias.
...
...
@@ -224,12 +238,16 @@ To make results comparable between different number of nodes, we lower the batch
For CIFAR10, we additionally use a momentum of $0.9$, but we do not use momentum on MNIST as it has limited impact on a linear model.
We evaluate 100- and 1000-nodes networks by creating multiple models in memory and simulating the exchange of messages between nodes. We compare our results either to a fully-connected topology with the same number of nodes or a single IID node. Both approaches ensure a single model is optimized, which therefore removes the effect of the topology. Both approaches also compute an equivalent gradient with the same expectation. However, using a single IID node is much faster to train, so we have preferred that approach for CIFAR10.
We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes. We compare our results against either a fully-connected network topology with the same number of nodes or a single IID node. In both approaches, the topology has no effect on the optimization. %ensure a single model is optimized, which therefore removes the effect of the topology.
While, both approaches compute an equivalent gradient with the same expectation, we favored using a single IID node for CIFAR10 for the sake of training speed.
For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where each color represent a class of data.
The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
%For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
\begin{figure}
...
...
@@ -248,11 +266,19 @@ For an intuition on the effect of local class bias, examine the neighbourhood of
\label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient will point in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Nodes diverge from one another according to the classes represented, more than in the IID case. Moreover, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps because the gradients are computed away from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented,% more than in the IID case.
Moreover, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.
A balanced representation of classes, similar to that of the IID case, can be recovered by modifying the topology such that each node is part of a clique with neighbours representing all classes. To ensure all cliques converge, inter-clique connections are introduced, established directly between nodes that are part of cliques. Because a joint location distribution $D_{\textit{clique}}=\sum_{i \in\textit{clique}} D_i$ is representative of the global distribution, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. Because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. \footnote{See Algorithm~\ref{Algorithm:D-Clique-Construction} in Appendix for set-based formulation of D-Cliques construction.}
In D-Cliques, we address the issues of non-iidness by carefully design the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
\begin{itemize}
\item D-Cliques recovers a balanced representation of classes, similar to that of the IID case, by modifying the topology such that each node is part of a \textit{clique} with neighbours representing all classes.
\item To ensure all cliques converge, \textit{inter-clique connections} are introduced, established directly between nodes that are part of cliques.
\end{itemize}
Because a joint location distribution $D_{\textit{clique}}=\sum_{i \in\textit{clique}} D_i$ is representative of the global distribution, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. Because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. \footnote{See Algorithm~\ref{Algorithm:D-Clique-Construction} in Appendix for set-based formulation of D-Cliques construction.}
Finally, weights are assigned to edges to ensure quick convergence. For this study we use Metropolis-Hasting weights~\cite{xiao2004fast}, which while not necessarily optimal, are quick to compute and still provide good convergence speed:
Finally, weights
%AMK: explain weights
are assigned to edges to ensure quick convergence. For this study we use Metropolis-Hasting weights~\cite{xiao2004fast}, which while not necessarily optimal, are quick to compute and still provide good convergence speed:
\begin{equation}
W_{ij} = \begin{cases}
\frac{1}{max(\text{degree}(i), \text{degree}(j)) + 1}&\text{if}~i \neq j, \text{and $\exists$ edge between $i$ and $j$}\\
...
...
@@ -261,7 +287,9 @@ Finally, weights are assigned to edges to ensure quick convergence. For this stu
\end{cases}
\end{equation}
We centrally generate the topology, which is then tested in a custom simulator. We expect our approach should be straightforward to adapt for a decentralized execution: the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}.
Note that for the sake of simplicity we assume that the topology is generated while assuming a global knowledge of the class distribution. Relaxing this assumption is part of future work.
%We centrally generate the topology, which is then tested in a custom simulator. We expect our approach should be straightforward to adapt for a decentralized execution: the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}.
\begin{figure}[htbp]
\centering
...
...
@@ -294,13 +322,18 @@ A network of 100 non-IID nodes with D-Cliques is illustrated in Figure~\ref{fig:
% In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed.
\section{Optimizing with Clique Averaging}
\label{section:clique-averaging}
In this section we present Clique Averaging,a feature that removes further the bias introduce by data non-iidness.
%AMK: check
\section{Removing Gradient Bias from Inter-Clique Edges}
\subsection{Removing Gradient Bias from Inter-Clique Edges}
\label{section:clique-averaging}
Inter-clique connections create sources of bias, either because of the Metropolis-Hasting edge weights or because some classes are more represented in a neighbourhood. Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge, i.e. this edge connects the green node of the left clique with the purple node of the right clique.
While limiting the number of inter-clique connections also limits the amount of data traveling on the network, it may introduce some bias as observed in ours experiments, either because of the Metropolis-Hasting edge weights or because some classes are more represented in a neighbourhood. Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge, i.e. this edge connects the green node of the left clique with the purple node of the right clique.
Using Metropolis-Hasting weights, Node A implicit self-edge will have a weight of $\frac{12}{110}$ while all of A's neighbours will have a weight of $\frac{11}{110}$, except the green node connected to B, that will have a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's purple class and away from the green class. The same analysis holds for all other nodes without inter-clique edges with their respective classes. For node B, all edges and B's self-edge will have weights of $\frac{1}{11}$. However, the green class is represented twice, once as a clique neighbour and once at the other end of the inter-clique edge, while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.
Using Metropolis-Hasting weights, Node A implicit self-edge has a weight of $\frac{12}{110}$ while all of A's neighbours have a weight of $\frac{11}{110}$, except the green node connected to B, that has a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's purple class and away from the green class. The same analysis holds for all other nodes without inter-clique edges with their respective classes. For node B, all edges and B's self-edge have weights of $\frac{1}{11}$. However, the green class is represented twice, once as a clique neighbour and once at the other end of the inter-clique edge, while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.
\begin{figure}[htbp]
\centering
...
...
@@ -334,7 +367,7 @@ We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algor
As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx80\%$.
\section{Implementing Momentum with Clique Averaging}
\subsection{Implementing Momentum with Clique Averaging}
\label{section:momentum}
Quickly training higher capacity models, such as a deep convolutional network, on harder datasets, such as CIFAR10, usually requires additional optimization techniques. We show here how Clique Averaging (Section~\ref{section:clique-averaging}) easily enables the implementation of optimization techniques in the presence of local class bias, that otherwise would require IID mini-batches.
Using momentum closes the gap, with a slightly lower convergence speed in the first 20 epochs, as illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}.
\section{Comparison to Similar Non-Clustered Topologies}
\label{section:non-clustered}
\section{Comparative evalution and extensions}
\label{section:non-clustered}
%AMK: add what is in there
In this section, we first compare D-Cliques to alternatives, we then further evaluate the impact of using Clique-Averaging and evaluate D-Cliques-based extensions.
\subsection{Comparing D-Cliques to alternatives}%Non-Clustered Topologies}
%\label{section:non-clustered}
We now show, in this section and the next, that the particular structure of D-Cliques is necessary. In particular, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours. In the latter case, since nodes do not form a clique, no node actually compute the same resulting average gradient.
%We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
We compare D-cliques against competitors to demonstrate its advantages over alternative topologies.
First, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours. In the latter case, since nodes do not form a clique, no node actually compute the same resulting average gradient.
Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems that clustering in this case has a small but significant effect.
...
...
@@ -402,10 +444,12 @@ Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-
For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.
\section{Importance of Intra-Clique Full Connectivity}
\subsection{Importance of Intra-Clique Full Connectivity}
\label{section:intra-clique-connectivity}
Intra-clique full connectivity is also necessary. Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-Cliques with respectively 1 and 5 edges randomly removed, out of 45 (2 and 10 out of 90 if counting both direction separately), as well as with and without Clique Averaging (resulting in a biased average gradient within cliques). In all cases, both for MNIST and CIFAR10, it has significant effect on the convergence speed. In the case of CIFAR10, it also negates the benefits of D-Cliques.
Intra-clique full connectivity is also necessary.
%AMK: check sentence above: justify
Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-Cliques with respectively 1 and 5 edges randomly removed, out of 45 (2 and 10 out of 90 if counting both direction separately), as well as with and without Clique Averaging (resulting in a biased average gradient within cliques). In all cases, both for MNIST and CIFAR10, it has significant effect on the convergence speed. In the case of CIFAR10, it also negates the benefits of D-Cliques.
\begin{figure}[htbp]
\centering
...
...
@@ -426,18 +470,21 @@ Intra-clique full connectivity is also necessary. Figure~\ref{fig:d-cliques-intr
\end{subfigure}
\caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity}
%AMK: how many nodes?
\end{figure}
\section{Scaling with Different Inter-Clique Topologies}
%\section{Scaling with Different Inter-Clique Topologies}
\subsection{Scaling with D-Cliques extensions}
%with Different Inter-Clique Topologies}
\label{section:interclique-topologies}
We finally evaluate the effect of the inter-clique topology on convergence speed on a larger network of 1000 nodes. We compare the scalability and convergence speed of variants based on D-Cliques, and therefore all using $O(nc)$ edges to create cliques as a foundation, where $n$ is the number of nodes and $c$ is the size of a clique.
First, the scheme that uses the fewest (almost\footnote{A path uses one less edge at significantly slower convergence speed and is therefore never really used in practice.}) number of extra edges is a \textit{ring}. A ring adds $\frac{n}{c}-1$ inter-clique edges and therefore scales linearly in $O(n)$.
Second, surprisingly (to us), another scheme also scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, as nodes are added, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
We introduce a second scheme that scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, as nodes are added, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
Third, cliques may also be connected in a smallworld-like~\cite{watts2000small} topology, that is reminiscent of distributed-hash table designs such as Chord~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a ring. Then each clique add symmetric edges, both clockwise and counter-clockwise on the ring, to the $ns$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring.\footnote{See Algorithm~\ref{Algorithm:Smallworld} in Appendix for a detailed listing.} This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small. This scheme adds a $2(ns)log(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes.
Third, we propose to connect cliques according to a smallworld-like~\cite{watts2000small} topology, applied to a ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a ring. Then each clique add symmetric edges, both clockwise and counter-clockwise on the ring, to the $ns$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring.\footnote{See Algorithm~\ref{Algorithm:Smallworld} in Appendix for a detailed listing.} This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small. This scheme adds a $2(ns)log(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes.
Finally, we also fully connect cliques together, which bounds the average shortest path to $2$ between any pair of nodes. This adds $\frac{n}{c}(\frac{n}{c}-1)$ edges, which scales quadratically in the number of nodes, in $O(\frac{n^2}{c^2})$, which can be significant at larger scales when $n$ is large compared to $c$.
...
...
@@ -550,7 +597,7 @@ non-IID data.
\section{Conclusion}
\label{section:conclusion}
We have proposed D-Cliques, a sparse topology that recovers the convergence speed and non-IID compensating behaviour of a fully-connected topology in the presence of local class bias. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually only possible with IID mini-batches. We have shown the clustering of D-Cliques and full connectivity within cliques to be critical in obtaining these results. Finally, we have evaluated different inter-clique topologies with 1000 nodes. While they all provide significant reduction in the number of edges compared to fully connecting all nodes, a smallworld approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed. The D-Clique topology approach therefore seems promising to reduce bandwidth usage on FL servers and to implement fully decentralized alternatives in a wider range of applications.
We have proposed D-Cliques, a sparse topology that recovers the convergence speed and non-IID compensating behaviour of a fully-connected topology in the presence of local class bias. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually only possible with IID mini-batches. We have shown the clustering of D-Cliques and full connectivity within cliques to be critical in obtaining these results. Finally, we have evaluated different inter-clique topologies with 1000 nodes. While they all provide significant reduction in the number of edges compared to fully connecting all nodes, a smallworld approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed. The D-Clique topology approach therefore seems promising to reduce bandwidth usage on FL servers and to implement fully decentralized alternatives in a wider range of applications. For instance, the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}. This is part of our future work.