@@ -301,7 +303,7 @@ symmetric, i.e. $W_{ij} = W_{ji}$, see \cite{lian2017d-psgd}.
As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of local class bias}: we
consider that each node only has samples
consider that each node only has examples
%examples
from a single class.
% Our results should generalize to lesser, and more
...
...
@@ -311,7 +313,8 @@ from a single class.
To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of samples.
equally represented in the global dataset; (2) All classes are represented on
the same number of nodes; (3) All nodes have the same number of examples.
We believe that these assumptions are reasonable in the context of our study
because: (1)
...
...
@@ -337,9 +340,9 @@ can remove much of the effect of local class bias.
We experiment with two datasets: MNIST~\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k samples from the original 60k
For MNIST, we use 45k and 10k examples from the original 60k
training set for training and validation respectively. The remaining 5k
training samples were randomly removed to ensure all 10 classes are balanced
training examples were randomly removed to ensure all 10 classes are balanced
while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of
the test set to measure test accuracy. For CIFAR10, classes are evenly
...
...
@@ -402,8 +405,9 @@ In this section, we present the design of D-Cliques. To give an intuition of our
% where each color represents a class of data.
The colors of a node represent the different classes it holds
locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each
node has samples of all classes in equal proportions. In the non-IID setting
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a
node has examples of all classes in equal proportions. In the non-IID setting
(Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only
a
single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the center node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes.
In the IID case, since gradients are computed from examples of all classes,
the resulting average gradient points in a direction that reduces the
...
...
@@ -566,7 +570,7 @@ of the local models across nodes.
inter-clique connections (see main text).}
\end{figure}
We address this problem by adding \emph{Clique Averaging} to D-PSGD
We address this problem by adding \emph{Clique Averaging} to D-SGD
(Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), which essentially
decouples gradient averaging from model averaging. Only the gradients of
neighbors within the same clique are used to compute the average gradient,
...
...
@@ -575,12 +579,15 @@ models, including those across inter-clique edges, participate in the model
averaging step as in the original version.
\begin{algorithm}[t]
\caption{D-PSGD with Clique Averaging, Node $i$}
\caption{D-SGD with Clique Averaging, Node $i$}
\label{Algorithm:Clique-Unbiased-D-PSGD}
\begin{algorithmic}[1]
\State\textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$
\State\textbf{Require} initial model parameters $x_i^{(0)}$, learning
rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
steps $K$
\For{$k =1,\ldots, K$}
\State$s_i^{(k)}\gets\textit{sample from~} D_i$
\State$s_i^{(k)}\gets\text{mini-batch sample of size $m$ drawn