The convergence speed of machine learning models trained with Federated
Learning is significantly affected by non-independent and identically
distributed (non-IID) data partitions, even more so in a fully decentralized
setting without a central server. In this paper, we show that the impact
setting without a central server. In this paper, we show that the impact of
\textit{local class bias} can be significantly reduced by carefully designing
the underlying communication topology. We present D-Cliques, a novel topology
that reduces gradient bias by grouping nodes in interconnected cliques such
...
...
@@ -110,9 +110,9 @@ network is organized according to a star topology: a central server orchestrates
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending
them back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary graph topology
fully decentralized FL algorithms operate over an arbitrary network topology
where participants communicate only with their direct neighbors
in the graph. A classic example of such algorithms is Decentralized
in the network. A classic example of such algorithms is Decentralized
SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
local SGD updates and model averaging with neighboring nodes.
...
...
@@ -305,7 +305,7 @@ from a single class.
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.
equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of samples.
We believe that these assumptions are reasonable in the context of our study
because: (1)
...
...
@@ -391,8 +391,8 @@ mini-batch size, both approaches are equivalent. %ensure a single
In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where each color represent a class of data.
The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where each color represents a class of data.
The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has samples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
%For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
...
...
@@ -413,10 +413,10 @@ The colors of a node, represented as a circle, correspond to the different class
\label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented,% more than in the IID case.
Moreover, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.
In the IID case, since gradients are computed from examples of all classes, the resulting average gradient points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented. % more than in the IID case.
In addition, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.
In D-Cliques, we address the issues of non-iidness by carefully design the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
In D-Cliques, we address the issues of non-iidness by carefully designing the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
\begin{itemize}
\item D-Cliques recovers a balanced representation of classes, similar to that of the IID case, by modifying the topology such that each node is part of a \textit{clique} with neighbours representing all classes.
\item To ensure all cliques converge, \textit{inter-clique connections} are introduced, established directly between nodes that are part of cliques.