denote model parameters by theta to avoid conflict with feature vector x in def of skew

ce2a41ab · aurelien.bellet · f5f9fd1b · ce2a41ab · ce2a41ab
Commit ce2a41ab authored 3 years ago by aurelien.bellet
--- a/mlsys2022style/d-cliques.tex
+++ b/mlsys2022style/d-cliques.tex
@@ -173,15 +173,15 @@ averaging step as in the original version.
   \caption{D-SGD with Clique Averaging, Node $i$}
   \label{Algorithm:Clique-Unbiased-D-PSGD}
   \begin{algorithmic}[1]
-        \STATE \textbf{Require} initial model parameters $x_i^{(0)}$, learning
+        \STATE \textbf{Require} initial model parameters $\theta_i^{(0)}$, learning
        rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
        steps $K$
        \FOR{$k = 1,\ldots, K$}
         \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
          from~} D_i$
-          \STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}}  \nabla F(x_j^{(k-1)}; s_j^{(k)})$
-          \STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$ 
-          \STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
+          \STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}}  \nabla F(\theta_j^{(k-1)}; s_j^{(k)})$
+          \STATE $\theta_i^{(k-\frac{1}{2})} \gets \theta_i^{(k-1)} - \gamma g_i^{(k)}$ 
+          \STATE $\theta_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} \theta_j^{(k-\frac{1}{2})}$
        \ENDFOR
   \end{algorithmic}
 \end{algorithm}
@@ -206,9 +206,9 @@ Clique Averaging (Section~\ref{section:clique-averaging})
 allows us to compute an unbiased momentum from the
 unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
 \begin{equation}
-v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)} 
+v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)}.
 \end{equation}
 It then suffices to modify the original gradient step to use momentum:
 \begin{equation}
-x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} 
+\theta_i^{(k-\frac{1}{2})} \leftarrow \theta_i^{(k-1)} - \gamma v_i^{(k)}.
 \end{equation}
\ No newline at end of file
--- a/mlsys2022style/setting.tex
+++ b/mlsys2022style/setting.tex
@@ -6,16 +6,19 @@

 We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
 collaboratively solve a classification task with $L$ classes. Each node has access to a local dataset that
- follows its own local distribution $D_i$. The goal is to find a global model
- $x$ that performs well on the union of the local distributions by minimizing
+ follows its own local distribution $D_i$. The goal is to find the parameters
+ $\theta$ of a global model that performs well on the union of the local
+ distributions by
+ minimizing
 the average training loss:
 \begin{equation}
-\min_{x} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
-{s_i \sim D_i} [F_i(x;s_i)],
+\min_{\theta} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
+{s_i \sim D_i} [F_i(\theta;s_i)],
 \label{eq:dist-optimization-problem}
 \end{equation}
 where $s_i$ is a data example drawn from $D_i$ and $F_i$ is the loss function
-on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(x;s_i)$ denotes  the
+on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(\theta;s_i)$ denotes 
+the
 expected loss of model $x$ on a random example $s_i$ drawn from $D_i$.

 To collaboratively solve Problem \eqref{eq:dist-optimization-problem}, each
@@ -30,7 +33,8 @@ Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
 shown in Algorithm~\ref{Algorithm:D-PSGD},
 a single iteration of D-SGD at node $i$ consists of sampling a mini-batch
 from its local distribution
-$D_i$, updating its local model $x_i$ by taking a stochastic gradient descent 
+$D_i$, updating its local model $\theta_i$ by taking a stochastic gradient
+descent
 (SGD) step according to the mini-batch, and performing a weighted average of
 its local model with those of its
 neighbors.
@@ -63,14 +67,14 @@ topology $G$, namely:\todo{AB: if we need space we can remove this equation}
   \caption{D-SGD, Node $i$}
   \label{Algorithm:D-PSGD}
   \begin{algorithmic}[1]
-        \STATE \textbf{Require:} initial model parameters $x_i^{(0)}$,
+        \STATE \textbf{Require:} initial model parameters $\theta_i^{(0)}$,
        learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
        number of steps $K$
        \FOR{$k = 1,\ldots, K$}
          \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
          from~} D_i$
-          \STATE $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$ 
-          \STATE $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
+          \STATE $\theta_i^{(k-\frac{1}{2})} \gets \theta_i^{(k-1)} - \gamma \nabla F(\theta_i^{(k-1)}; s_i^{(k)})$ 
+          \STATE $\theta_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} \theta_j^{(k-\frac{1}{2})}$
        \ENDFOR
   \end{algorithmic}
 \end{algorithm}
\ No newline at end of file