\item[\textbf{B.1}]\textit{Compute and output the global average rating ($\bar r_{\bullet,\bullet}$), the average rating for user 1 ($\bar r_{1,\bullet}$), the average rating for item 1 ($\bar r_{\bullet,1}$), the average deviation for item 1 ($\bar{\hat r}_{\bullet,1}$), and the predicted rating of user 1 for item 1 ($p_{1,1}$, Eq~\ref{eq:baseline}) using \texttt{data/ml-100k/u2.base} for training. When computing the item average for items that do not have ratings in the training set, use the global average ($\bar r_{\bullet, \bullet}$). When making predictions for items that are not in the training set, use the user average if present, otherwise the global average.}
\item [\textbf{B.2}] \textit{Compute the prediction accuracy (average MAE on \texttt{ml-100k/u2.test}) of the previous methods ($\bar r_{\bullet, \bullet}$, $\bar r_{u,\bullet}$, $\bar r_{\bullet,i}$) and that of the proposed baseline ($p_{u,i}$, Eq.~\ref{eq:baseline}). }
\item [\textbf{B.3}] \textit{Measure the time required for computing the MAE for all ratings in the test set (\texttt{ml-100k/u2.test}) with all four methods by recording the current time before and after (ex: with \texttt{System.nanoTime()} in Scala). The duration is the difference between the two. }
\textit{
Include the time for computing all values required to obtain the answer from the input dataset provided in the template: recompute from scratch all necessary values even if they are available after computing previous results (ex: global average $\bar r_{\bullet, \bullet}$). Also ensure you store the results in some auxiliary data structure (ex: $\texttt{Seq[(mae, timing)]}$) as you are performing measurements to ensure the compiler won't optimize away the computations that would otherwise be unnecessary.}
\textit{
For all four methods, perform three measurements and compute the average and standard-deviation.}
\textit{In your report, show in a figure the relationship between prediction precision (MAE) on the x axis, and the computation time on the y axis including the standard-deviation. Report also the technical specifications (model, CPU speed, RAM, OS, Scala language version, and JVM version) of the machine on which you ran the tests. Which of the four prediction methods is the most expensive to compute? Is the relationship between MAE and computation linear? What do you conclude on the computing needs of more accurate prediction methods?}
\end{itemize}
\section{Spark Distribution Overhead}
\subsection{Questions}
\label{section:q5}
Implement $p_{u,i}$ using Spark RDDs. Your distributed implementation should give the same results as your previous implementation using Scala's standard library. Once your implementation works well with \texttt{data/ml-100k/u2.base} and \texttt{data/ml-100k/u2.test}, stress test its performance with the bigger \newline\texttt{data/ml-25m/r2.train} and \texttt{data/ml-25m/r2.test}.
\begin{itemize}
\item [\textbf{D.1}] \textit{Ensure the results of your distributed implementation are consistent with \textbf{B.1} and \textbf{B.2} on \texttt{data/ml-100k/u2.train} and \texttt{data/ml-100k/u2.test}. Compute and output the global average rating ($\bar r_{\bullet,\bullet}$), the average rating for user 1 ($\bar r_{1,\bullet}$), the average rating for item 1 ($\bar r_{\bullet,1}$), the average deviation for item 1 ($\bar{\hat r}_{\bullet,1}$), and the predicted rating of user 1 for item 1 ($p_{1,1}$, Eq~\ref{eq:baseline}). Compute the prediction accuracy (average MAE on \texttt{ml-100k/u2.test}) of the proposed baseline ($p_{u,i}$, Eq.~\ref{eq:baseline}). }
\item [\textbf{D.2}] \textit{Measure the combined time to (1) pre-compute the required baseline values for predictions and (2) to predict all values of the test set on the 25M dataset, \texttt{data/ml-25m/r2.train} and \texttt{data/ml-25m/r2.test}. Compare the time required by your implementation using Scala's standard library (\textbf{B.1} and \textbf{B.2}) on your machine, and your new distributed implementation using Spark on \texttt{iccluster028}. Use 1 and 4 executors for Spark and repeat all three experiments (predict.Baseline, distributed.Baseline 1 worker, distributed.Baseline 4 workers) 3 times. Write in your report the average and standard deviation for all three experiments, as well as the specification of the machine on which you ran the tests (similar to B.3).}
\textit{As a comparison, our reference implementation runs in 44s on the cluster with 4 workers. Ensure you obtain results roughly in the same ballpark or faster. Don't worry if your code is slower during some executions because the cluster is busy.}
\textit{Try optimizing your local Scala implementation by avoiding temporary objects, instead preferring the use of mutable collations and data structures. Can you make it faster, running locally on your machine without Spark, than on the cluster with 4 workers? Explain the changes you have made to make your code faster in your report.}
\end{itemize}
\section{\textit{Personalized} Predictions}
\subsection{Questions}
\label{section:q1}
We will use \texttt{ml-100k/u3.base} and \texttt{ml-100k/u3.test} for grading so ensure you code runs without errors on them.
\item [\textbf{P.1}] \textit{Using uniform similarities of 1 between all users, compute the predicted rating of user 1 for item 1 ($p_{1,1}$) and the prediction accuracy (MAE on \texttt{ml-100k/u2.test}) of the personalized baseline predictor.}
\item [\textbf{P.2}] \textit{Using the the adjusted cosine similarity (Eq.~\ref{eq:similarity}), compute the similarity between user $1$ and user $2$ ($s_{1,2}$), the predicted rating of user 1 for item 1 ($p_{1,1}$ Eq.~\ref{eq:personalized-prediction}) and the prediction accuracy (MAE on \texttt{ml-100k/u2.test}) of the personalized baseline predictor.}
\item [\textbf{P.3}] \textit{Implement the Jaccard Coefficient\footnote{\url{https://en.wikipedia.org/wiki/Jaccard_index}}. Provide the mathematical formulation of your similarity metric in your report. User the jaccard similarity, compute the similarity between user $1$ and user $2$ ($s_{1,2}$), the predicted rating of user 1 for item 1 ($p_{1,1}$ Eq.~\ref{eq:personalized-prediction}) and the prediction accuracy (MAE on \texttt{ml-100k/u2.test}) of the personalized baseline predictor. Is the Jaccard Coefficient better or worst than Adjusted Cosine similarity?}
\end{itemize}
\section{Neighbourhood-Based Predictions}
\subsection{Questions}
\label{section:q2}
\begin{itemize}
\item [\textbf{N.1}] \textit{Implement the k-NN predictor. Do not include self-similarity in the k-nearest neighbours. Using $k=10$, \texttt{data/ml-100k/u2.base} for training output the similarities between: (1) user $1$ and itself; (2) user $1$ and user $864$; (3) user $1$ and user $886$. Still using $k=10$, output the prediction for user 1 and item 1 ($p_{1,1}$), and make sure that you obtain an MAE of $0.8287\pm0.0001$ on \texttt{data/ml-100k/u2.test}.}
\item [\textbf{N.2}] \textit{Report the MAE on \texttt{data/ml-100k/u2.test} for $k ={10, 30, 50, 100, 200, 300, 400, 800, 942}$. What is the lowest $k$ such that the MAE is lower than for the baseline (non-personalized) method?}
\item [\textbf{N.3}] \label{q-total-time}\textit{Measure the time required for computing predictions (without using Spark) on \texttt{data/ml-100k/u2.test}. Include the time to train the predictor on \newline\texttt{data/ml-100k/u2.base} including computing the similarities $s_{u,v}$ and using $k=300$. Try reducing the computation time with alternative implementation techniques (making sure you keep obtaining the same results). Mention in your report which alternatives you tried, which ones were fastest, and by how much. The teams with the correct answer and shortest times on a secret test set will obtain more points on this question.}
\end{itemize}
\section{Recommendation}
\subsection{Questions}
\label{section:q4}
\begin{itemize}
\item [\textbf{R.1}] \textit{Train a k-NN predictor with training data from \texttt{data/ml-100k/u.data}, augmented with additional ratings from user "$944$" provided in \texttt{personal.csv}, using adjusted cosine similarity and $k=300$. Report the prediction for user 1 item 1 ($p_{1,1}$)}.
\item [\textbf{R.2}] \textit{Report the top 3 recommendations for user "$944$" using the same k-NN predictor as for \textbf{R.1}. Include the movie identifier, the movie title, and the prediction score in the output. If additional recommendations have the same predicted value as the top 3 recommendations, prioritize the movies with the smallest identifiers in your top 3 (ex: if the top 8 recommendations all have predicted scores of \texttt{5.0}, choose the top 3 with the smallest ids.) so your results do not depend on the initial permutation of the recommendations.}