Skip to content
Snippets Groups Projects
Commit 88c149f1 authored by Erick Lavoie's avatar Erick Lavoie
Browse files

Added explanation on bias sources

parent e8566357
No related branches found
No related tags found
No related merge requests found
File added
figures/connected-cliques-bias.png

74 KiB

......@@ -219,17 +219,39 @@ In between, there might be enough redundancy in the dataset to arrange cliques i
\subsection{Decoupling Gradient Averaging from Weight Averaging}
Inter-clique connections create sources of bias in regular D-PSGD with Metropolis-Hasting (CITE):
\begin{itemize}
\item Non-uniform weights in neighbourhood for nodes not connected to other cliques
\item Non-uniform class representations in nodes connected to other cliques
\end{itemize}
Inter-clique connections create sources of bias. The distributed averaging algorithm, used by D-PSGD, relies on a good choice of weights for quick convergence, of which Metropolis-Hasting (CITE) provide a reasonable and inexpensive solution by considering only the immediate neighbours of every node. However, by averaging models after a gradient step, D-PSGD effectively gives a different weight to the gradient of neighbours.
TODO: Figure illustrating problem
\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{figures/connected-cliques-bias}
\caption{\label{fig:connected-cliques-bias} Sources of Bias in Connected Cliques: Non-uniform weights in neighbours of A (A has a higher weight); Non-uniform class representation in neighbours of B (extra green node).}
\end{figure}
Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge connecting the green node of the left clique with the purple node of the right clique. A simple Metropolis-Hasting weight assignment such as the following:
\begin{equation}
W_{ij} = \begin{cases}
max(\text{degree}(i), \text{degree}(j)) + 1 & \text{if}~i \neq j \\
1 - \sum_{j \neq i} W_{ij} & \text{otherwise}
\end{cases}
\end{equation}
We solve this problem by decoupling the gradient averaging from the weight averaging by sending each in separate rounds of messages.
Node A will have a weight of $\frac{12}{110}$ while all of A's neighbours will have a weight of $\frac{11}{110}$, except the green node connected to B, that will have a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's class and aways from the green class. The same analysis holds for all other nodes without inter-clique edges. For node B, all neighbours and B will have weights of $\frac{1}{11}$. However, the green class is represented twice while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.
TODO: New (minor) algorithm version of D-PSGD
We solve this problem by decoupling the gradient averaging from the weight averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, which provides an equal representation to all classes in the computation of the average gradient. But the model weights of all neighbours, including those across inter-clique edges, are used for computing the distributed average of models, which ensures that all models eventually converge to the same value. The clique-unbiased version of D-PSGD is listed in Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}.
\begin{algorithm}[h]
\caption{D-Clique (Clique-Unbiased D-PSGD), Node $i$}
\label{Algorithm:Clique-Unbiased-D-PSGD}
\begin{algorithmic}[1]
\State \textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$
\For{$k = 1,\ldots, K$}
\State $s_i^{(k)} \gets \textit{sample from~} D_i$
\State $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}} \nabla F(x_j^{(k-1)}; s_j^{(k)})$
\State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$
\State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
\EndFor
\end{algorithmic}
\end{algorithm}
\section{Applications}
......@@ -246,6 +268,13 @@ TODO: Update figure to use decoupled gradient averaging (will probably reduce va
\subsection{CIFAR10 and Convolutional Model}
Momentum, which reuses the Clique-Unbiased gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}.:
\begin{equation}
v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)}
\end{equation}
In addition, it is important that all nodes are initialized with the same model value at the beginning. Otherwise, the random initialization of models introduces another source of variance that persists many steps. In combination with D-Clique (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}), this provides the convergence results of Figure~\ref{fig:d-cliques-cifar10-convolutional}. To assess how far this would be from an "optimal" solution, in which the delay introduced by multiple hops between nodes is completely removed, we also show the convergence speed of a single node that would compute its average gradient from all the samples obtained by all nodes in a single round. The results show that minus the variance introduced by having multiples nodes, the convergence speed on average is close to the optimal.
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.48\textwidth}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment