diff --git a/main.bib b/main.bib index 52b6cbd1217fbadadd30ceb4e822b83b15957fa2..aa91cd3d7774a88826382a14f0972eb1140702fa 100644 --- a/main.bib +++ b/main.bib @@ -688,11 +688,10 @@ pages={211-252} } @misc{mnistWebsite, - title={{THE MNIST DATABASE of handwritten digits}}, + title={{The MNIST database of handwritten digits}}, author={LeCun, Yann and Cortes, Corinna and Burges, Christopher J.C.}, year={2020}, - howpublished={\url{http://yann.lecun.com/exdb/mnist/}}, - note={[online, accessed 2020-06-03]} + howpublished={\url{http://yann.lecun.com/exdb/mnist/}} } @misc{shallue2018measuring, diff --git a/main.tex b/main.tex index df7d5772cb8a754ca70657860f53cd76bc2503e2..d0260416b1b2ea3ed20a83867e79455194e41f7b 100644 --- a/main.tex +++ b/main.tex @@ -54,7 +54,7 @@ with Topology} The convergence speed of machine learning models trained with Federated Learning is significantly affected by non-independent and identically distributed (non-IID) data partitions, even more so in a fully decentralized -setting without a central server. In this paper, we show that the impact +setting without a central server. In this paper, we show that the impact of \textit{local class bias} can be significantly reduced by carefully designing the underlying communication topology. We present D-Cliques, a novel topology that reduces gradient bias by grouping nodes in interconnected cliques such @@ -110,9 +110,9 @@ network is organized according to a star topology: a central server orchestrates iteratively aggregating model updates received from the participants (\emph{clients}) and sending them back the aggregated model \cite{mcmahan2016communication}. In contrast, -fully decentralized FL algorithms operate over an arbitrary graph topology +fully decentralized FL algorithms operate over an arbitrary network topology where participants communicate only with their direct neighbors -in the graph. A classic example of such algorithms is Decentralized +in the network. A classic example of such algorithms is Decentralized SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between local SGD updates and model averaging with neighboring nodes. @@ -305,7 +305,7 @@ from a single class. To isolate the effect of local class bias from other potentially compounding factors, we make the following simplifying assumptions: (1) All classes are -equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples. +equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of samples. We believe that these assumptions are reasonable in the context of our study because: (1) @@ -391,8 +391,8 @@ mini-batch size, both approaches are equivalent. %ensure a single \section{D-Cliques: Creating Locally Representative Cliques} \label{section:d-cliques} -In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where each color represent a class of data. -The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. +In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}, represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where each color represents a class of data. +The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has samples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. %For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. @@ -413,10 +413,10 @@ The colors of a node, represented as a circle, correspond to the different class \label{fig:grid-iid-vs-non-iid-neighbourhood} \end{figure} -In the IID case, since gradients are computed from examples of all classes, the resulting average gradient points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented,% more than in the IID case. -Moreover, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical. +In the IID case, since gradients are computed from examples of all classes, the resulting average gradient points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented. % more than in the IID case. +In addition, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical. -In D-Cliques, we address the issues of non-iidness by carefully design the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}. +In D-Cliques, we address the issues of non-iidness by carefully designing the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}. \begin{itemize} \item D-Cliques recovers a balanced representation of classes, similar to that of the IID case, by modifying the topology such that each node is part of a \textit{clique} with neighbours representing all classes. \item To ensure all cliques converge, \textit{inter-clique connections} are introduced, established directly between nodes that are part of cliques.