define points as x,y instead of s

6daebf4a · aurelien.bellet · ce2a41ab · 6daebf4a · 6daebf4a
Commit 6daebf4a authored 3 years ago by aurelien.bellet
--- a/mlsys2022style/d-cliques.tex
+++ b/mlsys2022style/d-cliques.tex
@@ -177,9 +177,10 @@ averaging step as in the original version.
        rate $\gamma$, mixing weights $W$, mini-batch size $m$, number of
        steps $K$
        \FOR{$k = 1,\ldots, K$}
-         \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
+          \STATE $S_i^{(k)} \gets \text{mini-batch of $m$ samples drawn
          from~} D_i$
-          \STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}}  \nabla F(\theta_j^{(k-1)}; s_j^{(k)})$
+          \STATE $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in 
+          \textit{Clique(i)}}  \nabla F(\theta_j^{(k-1)}; S_j^{(k)})$
          \STATE $\theta_i^{(k-\frac{1}{2})} \gets \theta_i^{(k-1)} - \gamma g_i^{(k)}$ 
          \STATE $\theta_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} \theta_j^{(k-\frac{1}{2})}$
        \ENDFOR

--- a/mlsys2022style/setting.tex
+++ b/mlsys2022style/setting.tex
@@ -5,7 +5,12 @@
 \label{section:problem}

 We consider a set $N = \{1, \dots, n \}$ of $n$ nodes seeking to
-collaboratively solve a classification task with $L$ classes. Each node has access to a local dataset that
+collaboratively solve a classification task with $c$ classes. We denote a
+labeled data point by a tuple $(x,y)$ where $x$ represents the data point 
+(e.g., a feature vector) and $y\in\{1,\dots,c\}$ its label.
+Each
+node has
+access to a local dataset that
 follows its own local distribution $D_i$. The goal is to find the parameters
 $\theta$ of a global model that performs well on the union of the local
 distributions by
@@ -13,13 +18,16 @@ collaboratively solve a classification task with $L$ classes. Each node has acce
 the average training loss:
 \begin{equation}
 \min_{\theta} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
-{s_i \sim D_i} [F_i(\theta;s_i)],
+{(x_i,y_i) \sim D_i} [F_i(\theta;x_i,y_i)],
 \label{eq:dist-optimization-problem}
 \end{equation}
-where $s_i$ is a data example drawn from $D_i$ and $F_i$ is the loss function
-on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(\theta;s_i)$ denotes 
+where $(x_i,y_i)$ is a data point drawn from $D_i$ and $F_i$ is the loss
+function
+on node $i$. Therefore, $\mathds{E}_{(x_i,y_i) \sim D_i} F_i(\theta;x_i,y_i)$
+denotes 
 the
-expected loss of model $x$ on a random example $s_i$ drawn from $D_i$.
+expected loss of model $\theta$ over the local data distribution
+$D_i$.

 To collaboratively solve Problem \eqref{eq:dist-optimization-problem}, each
 node can exchange messages with its neighbors in an undirected network graph
@@ -31,7 +39,7 @@ between nodes $i$ and $j$.
 In this work, we use the popular Decentralized Stochastic
 Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
 shown in Algorithm~\ref{Algorithm:D-PSGD},
-a single iteration of D-SGD at node $i$ consists of sampling a mini-batch
+a single iteration of D-SGD at node $i$ consists in sampling a mini-batch
 from its local distribution
 $D_i$, updating its local model $\theta_i$ by taking a stochastic gradient
 descent
@@ -71,9 +79,10 @@ topology $G$, namely:\todo{AB: if we need space we can remove this equation}
        learning rate $\gamma$, mixing weights $W$, mini-batch size $m$,
        number of steps $K$
        \FOR{$k = 1,\ldots, K$}
-          \STATE $s_i^{(k)} \gets \text{mini-batch sample of size $m$ drawn
+          \STATE $S_i^{(k)} \gets \text{mini-batch of $m$ samples drawn
          from~} D_i$
-          \STATE $\theta_i^{(k-\frac{1}{2})} \gets \theta_i^{(k-1)} - \gamma \nabla F(\theta_i^{(k-1)}; s_i^{(k)})$ 
+          \STATE $\theta_i^{(k-\frac{1}{2})} \gets \theta_i^{(k-1)} - \gamma
+          \nabla F(\theta_i^{(k-1)}; S_i^{(k)})$ 
          \STATE $\theta_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} \theta_j^{(k-\frac{1}{2})}$
        \ENDFOR
   \end{algorithmic}