Skip to content
Snippets Groups Projects
Commit ce2a41ab authored by aurelien.bellet's avatar aurelien.bellet
Browse files

denote model parameters by theta to avoid conflict with feature vector x in def of skew

parent f5f9fd1b
No related branches found
No related tags found
No related merge requests found
......@@ -173,15 +173,15 @@ averaging step as in the original version.
\caption{D-SGD with Clique Averaging, Node $i$}
\label{Algorithm:Clique-Unbiased-D-PSGD}
\begin{algorithmic}[1]
\STATE \textbf{Require} initial model parameters $x_i^{(0)}$, learning
\STATE \textbf{Require} initial model parameters $\theta_i^{(0)}$, learning
rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
steps $K$
\FOR{$k = 1,\ldots, K$}
\STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
from~} D_i$
\STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}} \nabla F(x_j^{(k-1)}; s_j^{(k)})$
\STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$
\STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
\STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}} \nabla F(\theta_j^{(k-1)}; s_j^{(k)})$
\STATE $\theta_i^{(k-\frac{1}{2})} \gets \theta_i^{(k-1)} - \gamma g_i^{(k)}$
\STATE $\theta_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} \theta_j^{(k-\frac{1}{2})}$
\ENDFOR
\end{algorithmic}
\end{algorithm}
......@@ -206,9 +206,9 @@ Clique Averaging (Section~\ref{section:clique-averaging})
allows us to compute an unbiased momentum from the
unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)}
v_i^{(k)} \leftarrow m v_i^{(k-1)} + g_i^{(k)}.
\end{equation}
It then suffices to modify the original gradient step to use momentum:
\begin{equation}
x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)}
\theta_i^{(k-\frac{1}{2})} \leftarrow \theta_i^{(k-1)} - \gamma v_i^{(k)}.
\end{equation}
\ No newline at end of file
......@@ -6,16 +6,19 @@
We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
collaboratively solve a classification task with $L$ classes. Each node has access to a local dataset that
follows its own local distribution $D_i$. The goal is to find a global model
$x$ that performs well on the union of the local distributions by minimizing
follows its own local distribution $D_i$. The goal is to find the parameters
$\theta$ of a global model that performs well on the union of the local
distributions by
minimizing
the average training loss:
\begin{equation}
\min_{x} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
{s_i \sim D_i} [F_i(x;s_i)],
\min_{\theta} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
{s_i \sim D_i} [F_i(\theta;s_i)],
\label{eq:dist-optimization-problem}
\end{equation}
where $s_i$ is a data example drawn from $D_i$ and $F_i$ is the loss function
on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(x;s_i)$ denotes the
on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(\theta;s_i)$ denotes
the
expected loss of model $x$ on a random example $s_i$ drawn from $D_i$.
To collaboratively solve Problem \eqref{eq:dist-optimization-problem}, each
......@@ -30,7 +33,8 @@ Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
shown in Algorithm~\ref{Algorithm:D-PSGD},
a single iteration of D-SGD at node $i$ consists of sampling a mini-batch
from its local distribution
$D_i$, updating its local model $x_i$ by taking a stochastic gradient descent
$D_i$, updating its local model $\theta_i$ by taking a stochastic gradient
descent
(SGD) step according to the mini-batch, and performing a weighted average of
its local model with those of its
neighbors.
......@@ -63,14 +67,14 @@ topology $G$, namely:\todo{AB: if we need space we can remove this equation}
\caption{D-SGD, Node $i$}
\label{Algorithm:D-PSGD}
\begin{algorithmic}[1]
\STATE \textbf{Require:} initial model parameters $x_i^{(0)}$,
\STATE \textbf{Require:} initial model parameters $\theta_i^{(0)}$,
learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
number of steps $K$
\FOR{$k = 1,\ldots, K$}
\STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
from~} D_i$
\STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$
\STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
\STATE $\theta_i^{(k-\frac{1}{2})} \gets \theta_i^{(k-1)} - \gamma \nabla F(\theta_i^{(k-1)}; s_i^{(k)})$
\STATE $\theta_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} \theta_j^{(k-\frac{1}{2})}$
\ENDFOR
\end{algorithmic}
\end{algorithm}
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment