Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sacs/cs-449-sds-public/project/cs449-template-m2-2022
  • hlanfran/cs449-template-m2-2022
2 results
Show changes
Commits on Source (39)
Showing
with 912 additions and 354 deletions
......@@ -3,8 +3,10 @@
**/*.swp
data/.DS_Store
data/ml-100k
data/ml-25m
project/project
project/target
src/main/scala/project/
src/main/scala/target/
target/
logs/
\documentclass{article}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{ dsfont }
\usepackage{amsmath}
\usepackage{filemod}
\usepackage{ulem}
\usepackage{graphicx}
\usepackage{todonotes}
\input{Milestone-2-questions.sty}
% If you use BibTeX in apalike style, activate the following line:
\bibliographystyle{acm}
\title{CS-449 Project Milestone 2: Optimizing, Scaling, and Economics}
\author{
\textbf{Name}: xxx\\
\textbf{Sciper}: xxx\\
\textbf{Email:} xxx\\
\textbf{Name}: xxx\\
\textbf{Sciper}: xxx\\
\textbf{Email:} xxx\\
}
\begin{document}
\maketitle
\section{Optimizing with Breeze, a Linear Algebra Library}
\label{section:optimization}
\begin{itemize}
\item[\textbf{BR.1}] \BROne
\item [\textbf{BR.2}] \BRTwoOne
\BRTwoTwo
\BRTwoThree
\end{itemize}
\section{Parallel k-NN Computations with Replicated Ratings}
\begin{enumerate}
\item [\textbf{EK.1}] \EKOne
\item [\textbf{EK.2}] \EKTwo
\end{enumerate}
\section{Distributed Approximate k-NN}
\begin{enumerate}
\item [\textbf{AK.1}] \AKOne
\item [\textbf{AK.2}] \AKTwo
\item [\textbf{AK.3}] \AKThree
\end{enumerate}
\section{Economics}
\textit{Implement the computations for the different answers in the Economics.scala file. You don't need to provide unit tests for this question, nor written answers for these questions in your report.}
\end{document}
\ProvidesPackage{m2questions}[2022/03/11 v1.0]
% Breeze
\newcommand{\BROne}{
Reimplement the kNN predictor of Milestone 1 using the Breeze library and without using Spark. Using $k=10$ and \texttt{data/ml-100k/u2.base} for training, output the similarities between: (1) user $1$ and itself; (2) user $1$ and user $864$; (3) user $1$ and user $886$. Still using $k=10$, output the prediction for user 1 and item 1 ($p_{1,1}$), the prediction for user 327 and item 2 ($p_{327,2}$), and make sure that you obtain an MAE of $0.8287 \pm 0.0001$ on \texttt{data/ml-100k/u2.test}.
}
\newcommand{\BRTwoOne}{
Try making your implementation as fast as possible, both for computing all k-nearest neighbours and for computing the predictions and MAE on a test set. Your implementation should be based around \texttt{CSCMatrix}, but may involve conversions for individual operations. We will test your implementation on a secret test set. The teams with both a correct answer and the shortest time will receive more points.
}
\newcommand{\BRTwoTwo}{
Using $k=300$, compare the time for predicting all values and computing the MAE of \texttt{ml-100k/u2.test} to the one you obtained in Milestone 1. What is the speedup of your new implementation (as a ratio of $\frac{\textit{average time}_{old}}{\textit{average time}_{new}}$)? Use the same machine to measure the time for both versions and provide the answer in your report.
}
\newcommand{\BRTwoThree}{
Also ensure your implementation works with \texttt{data/ml-1m/rb.train} and \texttt{data/ml-1m/rb.test} since you will reuse it in the next questions.
}
% Parallel Exact Knn
\newcommand{\EKOne}{
Test your parallel implementation of k-NN for correctness with two workers. Using $k=10$ and \texttt{data/ml-100k/u2.base} for training, output the similarities between: (1) user $1$ and itself; (2) user $1$ and user $864$; (3) user $1$ and user $886$. Still using $k=10$, output the prediction for user 1 and item 1 ($p_{1,1}$), the prediction for user 327 and item 2 ($p_{327,2}$), and make sure that you obtain an MAE of $0.8287 \pm 0.0001$ on \texttt{data/ml-100k/u2.test}
}
\newcommand{\EKTwo}{
Measure and report the combined \textit{k-NN} and \textit{prediction} time when using 1, 2, 4 workers, $k=300$, and \texttt{ml-1m/rb.train} for training and \texttt{ml-1m/rb.test} for test, on the cluster (or a machine with at least 4 physical cores). Perform 3 measurements for each experiment and report the average and standard-deviation total time, including training, making predictions, and computing the MAE. Do you observe a speedup? Does this speedup grow linearly with the number of executors, i.e. is the running time $X$ times faster when using $X$ executors compared to using a single executor? Answer both questions in your report.
}
% Approximate Knn
\newcommand{\AKOne}{
Implement the approximate k-NN using your previous breeze implementation and Spark's RDDs. Using the partitioner of the template with 10 partitions and 2 replications, $k=10$, and \texttt{data/ml-100k/u2.base} for training, output the similarities of the approximate k-NN between user $1$ and the following users: $1,864,344,16,334,2$.
}
\newcommand{\AKTwo}{
Vary the number of partitions in which a given user appears. For the \texttt{data/ml-100k/u2.base} training set, partitioned equally between 10 workers, report the relationship between the level of replication (1,2,3,4,6,8) and the MAE you obtain on the \texttt{data/ml-100k/u2.test} test set. What is the minimum level of replication such that the MAE is still lower than the baseline predictor of Milestone 1 (MAE of 0.7604), when using $k=300$? Does this reduce the number of similarity computations compared to an exact k-NN? What is the ratio? Answer both questions in your report.
}
\newcommand{\AKThree}{
Measure and report the time required by your approximate \textit{k-NN} implementation, including both training on \texttt{data/ml-1m/rb.train} and computing the MAE on the test set \texttt{data/ml-1m/rb.test}, using $k=300$ on 8 partitions with a replication factor of 1 when using 1, 2, 4 workers. Perform each experiment 3 times and report the average and standard-deviation. Do you observe a speedup compared to the parallel (exact) k-NN with replicated ratings for the same number of workers?
}
% Economics
\newcommand{\EOne}{
What is the minimum number of days of renting to make buying the ICC.M7 less expensive, excluding any operating costs such as electricity and maintenance? Round up to the nearest integer.
}
\newcommand{\ETwoOne}{
After how many days of renting a container, is the cost higher than buying and running 4 Raspberry Pis? (1) Assuming optimistically no maintenance at minimum power usage for RPis, and (2) no maintenance at maximum power usage for RPis, to obtain a likely range. (Round up to the nearest integer in each case).
}
\newcommand{\ETwoTwo}{
Assume a single processor for the container and an equivalent amount of total RAM as the 4 Raspberry Pis. Also provide unrounded intermediary results for (1) Container Daily Cost, (2) 4 RPis (Idle) Daily Electricity Cost, (3) 4 RPis (Computing) Daily Electricity Cost.
}
\newcommand{\EThree}{
For the same buying price as an ICC.M7, how many Raspberry Pis can you get (floor the result to remove the decimal)? Assuming perfect scaling, would you obtain a larger overall throughput and RAM from these? If so, by how much? Compute the ratios using the previous floored number of RPis, but do not round the final results.
}
\ No newline at end of file
File added
# Milestone Description
[Milestone-2.pdf](./Milestone-2.pdf)
Note: Section 'Updates' lists the updates since the original release of the Milestone.
Mu has prepared a report template for your convenience here: [Report Template](./Milestone-2-QA-template.tex).
# Dependencies
````
sbt >= 1.4.7
openjdk@8
````
Should be available by default on the IC Cluster. Otherwise, refer to each project installation instructions.
Should be available by default on ````iccluster028.iccluster.epfl.ch````. Otherwise, refer to each project installation instructions. Prefer working locally on your own machine, you will have less interference in your measurements from other students.
If you work on ````iccluster028.iccluster.epfl.ch````, you need to modify the PATH by default by adding the following line in ````~/.bashrc````:
````
export PATH=$PATH:/opt/sbt/sbt/bin
````
If you have multiple installations of openjdk, you need to specify the one to use as JAVA_HOME, e.g. on OSX with
openjdk@8 installed through Homebrew, you would do:
````
export JAVA_HOME="/usr/local/Cellar/openjdk@8/1.8.0+282";
````
# Dataset
Download the ````ml-100k.zip```` dataset in the ````data/```` folder:
Download [data-m2.zip](https://gitlab.epfl.ch/sacs/cs-449-sds-public/project/dataset/-/raw/main/data-m2.zip).
Unzip:
````
> mkdir data
> cd data
> wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
> unzip data-m2.zip
````
Check the integrity of the file with (it should give the same number as below):
It should unzip into ````data/```` by default. If not, manually move ````ml-100k```` and ````ml-1m```` into ````data/````.
# Repository Structure
````src/main/scala/shared/predictions.scala````:
All the functionalities of your code for all questions should be defined there.
This code should then be used in the following applications and tests.
## Applications
1. ````src/main/scala/optimizing/Optimizing.scala````: Output answers to questions **BR.X**.
2. ````src/main/scala/distributed/Exact.scala````: Output answers to questions **EK.X**.
3. ````src/main/scala/distributed/Approximate.scala````: Output answers to questions **AK.X**.
4. ````src/main/scala/economics/Economics.scala````: Output answers to questions **E.X**
Applications are separate from tests to make it easier to test with different
inputs and permit outputting your answers and timings in JSON format for easier
grading.
## Unit Tests
Corresponding unit tests for each application (except Economics.scala):
````
> md5 -q ml-100k.zip
0e33842e24a9c977be4e0107933c0723
src/test/scala/optimizing/OptimizingTests.scala
src/test/scala/distributed/ExactTests.scala
src/test/scala/distributed/ApproximateTests.scala
````
Unzip:
Your tests should demonstrate how to call your code to obtain the answers of
the applications, and should make exactly the same calls as for the
applications above. This structure intentionally encourages you to put as
little as possible functionality in the application. This also gives the TA a
clear and regular structure to check its correctness.
# Usage
## Execute unit tests
````
> unzip ml-100k.zip
sbt "testOnly test.AllTests"
````
# Personal Ratings
You should fill all tests and ensure they all succeed prior to submission.
## Run applications
Add your ratings in the 'data/personal.csv' file, by providing a numerical rating between [1,5] for at least 20 movies. For example, to rate the 'Toy Story' movie with '5', modify this line:
### Optimizing
````
1,Toy Story (1995),
sbt "runMain scaling.Optimizing --train data/ml-100k/u2.base --test data/ml-100k/u2.test --json optimizing-100k.json --master local[1] --users 943 --movies 1682"
````
to this:
### Parallel Exact KNN
````
1,Toy Story (1995),5
sbt "runMain distributed.Exact --train data/ml-100k/u2.base --test data/ml-100k/u2.test --json exact-100k-4.json --k 10 --master local[4] --users 943 --movies 1682"
````
Do include your own ratings in your final submission so we can check your answers against those provided in your report.
### Approximate KNN
# Usage
````
sbt "runMain distributed.Approximate --train data/ml-100k/u2.base --test data/ml-100k/u2.test --json approximate-100k-4-k10-r2.json --k 10 --master local[4] --users 943 --movies 1682 --partitions 10 --replication 2"
````
## Compute statistics
### Economics
````
> sbt "runMain stats.Analyzer --data data/ml-100k/u.data --json statistics.json"
sbt "runMain economics.Economics --json economics.json"
````
## Compute predictions
## Time applications
For all the previous applications, you can set the number of measurements for timings by adding the following option ````--num_measurements X```` where X is an integer. The default value is ````0````.
## IC Cluster
Test your application locally as much as possible and only test on the iccluster
once everything works, to keep the cluster and the driver node maximally available
for other students.
### Assemble Application for Spark Submit
````sbt clean````: clean up temporary files and previous assembly packages.
````sbt assembly````: create a new jar
````target/scala-2.11/m2_yourid-assembly-1.0.jar```` that can be used with
````spark-submit````.
Prefer packaging your application locally and upload the tar archive of your application
before running on cluster.
### Upload jar on Cluster
````
> sbt "runMain predict.Predictor --train data/ml-100k/u1.base --test data/ml-100k/u1.test --json predictions.json"
scp target/scala-2.11/m2_yourid-assembly-1.0.jar <username>@iccluster028.iccluster.epfl.ch:~
````
## Compute recommendations
### Run on Cluster
See [config.sh](./config.sh) for HDFS paths to pre-uploaded train and test datasets to replace TRAIN and TEST, like in the example commands below:
#### When using ML-100k
````
spark-submit --class distributed.Exact --master yarn --conf "spark.dynamicAllocation.enabled=false" --num-executors 1 m2_yourid-assembly-1.0.jar --json exact-100k-1.json --train $ML100Ku2base --test $ML100Ku2test
````
> sbt "runMain recommend.Recommender --data data/ml-100k/u.data --personal data/personal.csv --json recommendations.json"
#### When using ML-1m
````
spark-submit --class distributed.Exact --master yarn --conf "spark.dynamicAllocation.enabled=false" --num-executors 1 m2_yourid-assembly-1.0.jar --json exact-1m-1.json --train $ML1Mrbtrain --test $ML1Mrbtest --separator :: --k 300 --users 6040 --movies 3952
````
In order to keep results obtained with different parameters in different .json files, simply modify the corresponding parameter ("--json") passed and the values. For instance, with ```--num-executors 4``` : ```--json exact-1m-4.json```.
Note that when changing from ML-100k to ML-1M, the parameter ```--separator ::``` should be added, and the number of users and movies should be modified.
## Grading scripts
## Package for submission
We will use the following scripts to grade your submission:
1. ````./test.sh````: Run all unit tests.
2. ````./run.sh````: Run all applications without timing measurements.
3. ````./time.sh````: Run all timing measurements.
All scripts will produce execution logs in the ````logs````
directory, including answers produced in the JSON format. Logs directories are
in the format ````logs/<scriptname>-<datetime>-<machine>/```` and include at
least an execution log ````log.txt```` as well as possible JSON outputs from
applications.
Ensure all scripts run correctly locally before submitting.
## Submission
Steps:
1. Update the ````name````, ````maintainer```` fields of ````build.sbt````, with the correct Milestone number, your ID, and your email.
2. Ensure you only used the dependencies listed in ````build.sbt```` in this template, and did not add any other.
3. Remove ````project/project````, ````project/target````, and ````target/````.
4. Test that all previous commands for generating statistics, predictions, and recommendations correctly produce a JSON file (after downloading/reinstalling dependencies).
5. Remove the ml-100k dataset (````data/ml-100k.zip````, and ````data/ml-100k````), as well as the````project/project````, ````project/target````, and ````target/````.
6. Add your report and any other necessary files listed in the Milestone description (see ````Deliverables````).
7. Zip the archive.
8. Submit to the TA for grading.
1. Create a zip archive with all your code within ````src/````, as well as your report: ````zip sciper1-sciper2.zip -r src/ report.pdf````
2. Submit ````sciper1-sciper2.zip```` the TA for grading on
https://cs449-submissions.epfl.ch:8083/m2 using the passcode you have previously received by email.
# References
......@@ -83,8 +185,6 @@ Scallop Argument Parsing: https://github.com/scallop/scallop/wiki
Spark Resilient Distributed Dataset (RDD): https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/rdd/RDD.html
JSON Serialization: https://github.com/json4s/json4s#serialization
# Credits
Erick Lavoie (Design, Implementation, Tests)
......
name := "m1_yourid"
name := "m2_yourid"
version := "1.0"
maintainer := "your.name@epfl.ch"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.0" % Test
libraryDependencies += "org.rogach" %% "scallop" % "4.0.2"
libraryDependencies += "org.json4s" %% "json4s-jackson" % "3.6.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.7"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.7"
libraryDependencies += "org.scalanlp" %% "breeze" % "0.13.2"
libraryDependencies += "org.scalanlp" %% "breeze-natives" % "0.13.2"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.0" % Test
libraryDependencies += "com.lihaoyi" %% "ujson" % "1.5.0"
scalaVersion in ThisBuild := "2.12.13"
scalaVersion in ThisBuild := "2.11.12"
enablePlugins(JavaAppPackaging)
logBuffered in Test := false
test in assembly := {}
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
if [ $(hostname) == 'iccluster028' ];
then
ICCLUSTER=hdfs://iccluster028.iccluster.epfl.ch:8020
export ML100Ku2base=$ICCLUSTER/cs449/data/ml-100k/u2.base;
export ML100Ku2test=$ICCLUSTER/cs449/data/ml-100k/u2.test;
export ML100Kudata=$ICCLUSTER/cs449/data/ml-100k/u.data;
export ML1Mrbtrain=$ICCLUSTER/cs449/data/ml-1m/rb.train;
export ML1Mrbtest=$ICCLUSTER/cs449/data/ml-1m/rb.test;
export SPARKMASTER='yarn'
else
export ML100Ku2base=data/ml-100k/u2.base;
export ML100Ku2test=data/ml-100k/u2.test;
export ML100Kudata=data/ml-100k/u.data;
export ML1Mrbtrain=data/ml-1m/rb.train;
export ML1Mrbtest=data/ml-1m/rb.test;
export SPARKMASTER='local[4]'
fi;
1,Toy Story (1995),
2,GoldenEye (1995),
id,title,rating
1,Toy Story (1995),5
2,GoldenEye (1995),3
3,Four Rooms (1995),
4,Get Shorty (1995),
5,Copycat (1995),
......@@ -19,13 +20,13 @@
19,Antonia's Line (1995),
20,Angels and Insects (1995),
21,Muppet Treasure Island (1996),
22,Braveheart (1995),
22,Braveheart (1995),3
23,Taxi Driver (1976),
24,Rumble in the Bronx (1995),
25,Birdcage,
26,Brothers McMullen,
27,Bad Boys (1995),
28,Apollo 13 (1995),
28,Apollo 13 (1995),3
29,Batman Forever (1995),
30,Belle de jour (1967),
31,Crimson Tide (1995),
......@@ -47,13 +48,13 @@
47,Ed Wood (1994),
48,Hoop Dreams (1994),
49,I.Q. (1994),
50,Star Wars (1977),
50,Star Wars (1977),4
51,Legends of the Fall (1994),
52,Madness of King George,
53,Natural Born Killers (1994),
54,Outbreak (1995),
55,Professional,
56,Pulp Fiction (1994),
56,Pulp Fiction (1994),5
57,Priest (1994),
58,Quiz Show (1994),
59,Three Colors: Red (1994),
......@@ -61,14 +62,14 @@
61,Three Colors: White (1994),
62,Stargate (1994),
63,Santa Clause,
64,Shawshank Redemption,
64,Shawshank Redemption,5
65,What's Eating Gilbert Grape (1993),
66,While You Were Sleeping (1995),
67,Ace Ventura: Pet Detective (1994),
68,Crow,
69,Forrest Gump (1994),
69,Forrest Gump (1994),5
70,Four Weddings and a Funeral (1994),
71,Lion King,
71,Lion King,5
72,Mask,
73,Maverick (1994),
74,Faster Pussycat! Kill! Kill! (1965),
......@@ -79,24 +80,24 @@
79,Fugitive,
80,Hot Shots! Part Deux (1993),
81,Hudsucker Proxy,
82,Jurassic Park (1993),
82,Jurassic Park (1993),3
83,Much Ado About Nothing (1993),
84,Robert A. Heinlein's The Puppet Masters (1994),
85,Ref,
86,Remains of the Day,
87,Searching for Bobby Fischer (1993),
88,Sleepless in Seattle (1993),
89,Blade Runner (1982),
89,Blade Runner (1982),3
90,So I Married an Axe Murderer (1993),
91,Nightmare Before Christmas,
92,True Romance (1993),
93,Welcome to the Dollhouse (1995),
94,Home Alone (1990),
95,Aladdin (1992),
96,Terminator 2: Judgment Day (1991),
95,Aladdin (1992),4
96,Terminator 2: Judgment Day (1991),5
97,Dances with Wolves (1990),
98,Silence of the Lambs,
99,Snow White and the Seven Dwarfs (1937),
99,Snow White and the Seven Dwarfs (1937),1
100,Fargo (1996),
101,Heavy Metal (1981),
102,Aristocats,
......@@ -118,13 +119,13 @@
118,Twister (1996),
119,Maya Lin: A Strong Clear Vision (1994),
120,Striptease (1996),
121,Independence Day (ID4) (1996),
122,Cable Guy,
121,Independence Day (ID4) (1996),1
122,Cable Guy,1
123,Frighteners,
124,Lone Star (1996),
125,Phenomenon (1996),
126,Spitfire Grill,
127,Godfather,
127,Godfather,5
128,Supercop (1992),
129,Bound (1996),
130,Kansas City (1996),
......@@ -135,73 +136,73 @@
135,2001: A Space Odyssey (1968),
136,Mr. Smith Goes to Washington (1939),
137,Big Night (1996),
138,D3: The Mighty Ducks (1996),
138,D3: The Mighty Ducks (1996),2
139,Love Bug,
140,Homeward Bound: The Incredible Journey (1993),
141,20,
142,Bedknobs and Broomsticks (1971),
143,Sound of Music,
144,Die Hard (1988),
143,Sound of Music,4
144,Die Hard (1988),3
145,Lawnmower Man,
146,Unhook the Stars (1996),
147,Long Kiss Goodnight,
148,Ghost and the Darkness,
149,Jude (1996),
150,Swingers (1996),
151,Willy Wonka and the Chocolate Factory (1971),
151,Willy Wonka and the Chocolate Factory (1971),4
152,Sleeper (1973),
153,Fish Called Wanda,
154,Monty Python's Life of Brian (1979),
155,Dirty Dancing (1987),
156,Reservoir Dogs (1992),
154,Monty Python's Life of Brian (1979),5
155,Dirty Dancing (1987),3
156,Reservoir Dogs (1992),5
157,Platoon (1986),
158,Weekend at Bernie's (1989),
159,Basic Instinct (1992),
160,Glengarry Glen Ross (1992),
161,Top Gun (1986),
161,Top Gun (1986),3
162,On Golden Pond (1981),
163,Return of the Pink Panther,
164,Abyss,
163,Return of the Pink Panther,4
164,Abyss,2
165,Jean de Florette (1986),
166,Manon of the Spring (Manon des sources) (1986),
167,Private Benjamin (1980),
168,Monty Python and the Holy Grail (1974),
168,Monty Python and the Holy Grail (1974),5
169,Wrong Trousers,
170,Cinema Paradiso (1988),
171,Delicatessen (1991),
172,Empire Strikes Back,
172,Empire Strikes Back,2
173,Princess Bride,
174,Raiders of the Lost Ark (1981),
175,Brazil (1985),
176,Aliens (1986),
177,"Good, The Bad and the Ugly The",
177,"The Good the Bad and the Ugly",5
178,12 Angry Men (1957),
179,Clockwork Orange,
180,Apocalypse Now (1979),
181,Return of the Jedi (1983),
179,Clockwork Orange,4
180,Apocalypse Now (1979),3
181,Return of the Jedi (1983),3
182,GoodFellas (1990),
183,Alien (1979),
184,Army of Darkness (1993),
185,Psycho (1960),
183,Alien (1979),5
184,Army of Darkness (1993),2
185,Psycho (1960),4
186,Blues Brothers,
187,Godfather: Part II,
188,Full Metal Jacket (1987),
187,Godfather: Part II,4
188,Full Metal Jacket (1987),5
189,Grand Day Out,
190,Henry V (1989),
191,Amadeus (1984),
191,Amadeus (1984),3
192,Raging Bull (1980),
193,Right Stuff,
194,Sting,
195,Terminator,
196,Dead Poets Society (1989),
196,Dead Poets Society (1989),5
197,Graduate,
198,Nikita (La Femme Nikita) (1990),
199,Bridge on the River Kwai,
200,Shining,
201,Evil Dead II (1987),
202,Groundhog Day (1993),
202,Groundhog Day (1993),5
203,Unforgiven (1992),
204,Back to the Future (1985),
204,Back to the Future (1985),2
205,Patton (1970),
206,Akira (1988),
207,Cyrano de Bergerac (1990),
......@@ -211,7 +212,7 @@
211,M*A*S*H (1970),
212,Unbearable Lightness of Being,
213,Room with a View,
214,Pink Floyd - The Wall (1982),
214,Pink Floyd - The Wall (1982),4
215,Field of Dreams (1989),
216,When Harry Met Sally... (1989),
217,Bram Stoker's Dracula (1992),
......@@ -232,9 +233,9 @@
232,Young Guns (1988),
233,Under Siege (1992),
234,Jaws (1975),
235,Mars Attacks! (1996),
235,Mars Attacks! (1996),2
236,Citizen Ruth (1996),
237,Jerry Maguire (1996),
237,Jerry Maguire (1996),3
238,Raising Arizona (1987),
239,Sneakers (1992),
240,Beavis and Butt-head Do America (1996),
......@@ -247,16 +248,16 @@
247,Turbo: A Power Rangers Movie (1997),
248,Grosse Pointe Blank (1997),
249,Austin Powers: International Man of Mystery (1997),
250,Fifth Element,
250,Fifth Element,2
251,Shall We Dance? (1996),
252,Lost World: Jurassic Park,
253,Pillow Book,
254,Batman & Robin (1997),
255,My Best Friend's Wedding (1997),
256,When the Cats Away (Chacun cherche son chat) (1996),
257,Men in Black (1997),
258,Contact (1997),
259,George of the Jungle (1997),
257,Men in Black (1997),2
258,Contact (1997),3
259,George of the Jungle (1997),3
260,Event Horizon (1997),
261,Air Bud (1997),
262,In the Company of Men (1997),
......@@ -267,9 +268,9 @@
267,unknown,
268,Chasing Amy (1997),
269,Full Monty,
270,Gattaca (1997),
270,Gattaca (1997),3
271,Starship Troopers (1997),
272,Good Will Hunting (1997),
272,Good Will Hunting (1997),5
273,Heat (1995),
274,Sabrina (1995),
275,Sense and Sensibility (1995),
......@@ -291,7 +292,7 @@
291,Absolute Power (1997),
292,Rosewood (1997),
293,Donnie Brasco (1997),
294,Liar Liar (1997),
294,Liar Liar (1997),2
295,Breakdown (1997),
296,Promesse,
297,Ulee's Gold (1997),
......@@ -310,7 +311,7 @@
310,Rainmaker,
311,Wings of the Dove,
312,Midnight in the Garden of Good and Evil (1997),
313,Titanic (1997),
313,Titanic (1997),3
314,3 Ninjas: High Noon At Mega Mountain (1998),
315,Apt Pupil (1998),
316,As Good As It Gets (1997),
......@@ -359,9 +360,9 @@
359,Assignment,
360,Wonderland (1997),
361,Incognito (1997),
362,Blues Brothers 2000 (1998),
362,Blues Brothers 2000 (1998),1
363,Sudden Death (1995),
364,Ace Ventura: When Nature Calls (1995),
364,Ace Ventura: When Nature Calls (1995),1
365,Powder (1995),
366,Dangerous Minds (1995),
367,Clueless (1995),
......@@ -371,7 +372,7 @@
371,Bridges of Madison County,
372,Jeffrey (1995),
373,Judge Dredd (1995),
374,Mighty Morphin Power Rangers: The Movie (1995),
374,Mighty Morphin Power Rangers: The Movie (1995),1
375,Showgirls (1995),
376,Houseguest (1994),
377,Heavyweights (1994),
......@@ -381,8 +382,8 @@
381,Muriel's Wedding (1994),
382,Adventures of Priscilla,
383,Flintstones,
384,Naked Gun 33 1/3: The Final Insult (1994),
385,True Lies (1994),
384,Naked Gun 33 1/3: The Final Insult (1994),3
385,True Lies (1994),2
386,Addams Family Values (1993),
387,Age of Innocence,
388,Beverly Hills Cop III (1994),
......@@ -395,14 +396,14 @@
395,Robin Hood: Men in Tights (1993),
396,Serial Mom (1994),
397,Striking Distance (1993),
398,Super Mario Bros. (1993),
398,Super Mario Bros. (1993),1
399,Three Musketeers,
400,Little Rascals,
401,Brady Bunch Movie,
402,Ghost (1990),
403,Batman (1989),
404,Pinocchio (1940),
405,Mission: Impossible (1996),
405,Mission: Impossible (1996),3
406,Thinner (1996),
407,Spy Hard (1996),
408,Close Shave,
......@@ -428,7 +429,7 @@
428,Harold and Maude (1971),
429,Day the Earth Stood Still,
430,Duck Soup (1933),
431,Highlander (1986),
431,Highlander (1986),2
432,Fantasia (1940),
433,Heathers (1989),
434,Forbidden Planet (1956),
......@@ -498,7 +499,7 @@
498,African Queen,
499,Cat on a Hot Tin Roof (1958),
500,Fly Away Home (1996),
501,Dumbo (1941),
501,Dumbo (1941),3
502,Bananas (1971),
503,Candidate,
504,Bonnie and Clyde (1967),
......@@ -538,7 +539,7 @@
538,Anastasia (1997),
539,Mouse Hunt (1997),
540,Money Train (1995),
541,Mortal Kombat (1995),
541,Mortal Kombat (1995),1
542,Pocahontas (1995),
543,Misérables,
544,Things to Do in Denver when You're Dead (1995),
......@@ -669,14 +670,14 @@
669,Body Parts (1991),
670,Body Snatchers (1993),
671,Bride of Frankenstein (1935),
672,Candyman (1992),
672,Candyman (1992),1
673,Cape Fear (1962),
674,Cat People (1982),
675,Nosferatu (Nosferatu,
676,Crucible,
677,Fire on the Mountain (1996),
678,Volcano (1997),
679,Conan the Barbarian (1981),
679,Conan the Barbarian (1981),5
680,Kull the Conqueror (1997),
681,Wishmaster (1997),
682,I Know What You Did Last Summer (1997),
......@@ -688,7 +689,7 @@
688,Leave It to Beaver (1997),
689,Jackal,
690,Seven Years in Tibet (1997),
691,Dark City (1998),
691,Dark City (1998),3
692,American President,
693,Casino (1995),
694,Persuasion (1995),
......@@ -735,7 +736,7 @@
735,Philadelphia (1993),
736,Shadowlands (1993),
737,Sirens (1994),
738,Threesome (1994),
738,Threesome (1994),1
739,Pretty Woman (1990),
740,Jane Eyre (1996),
741,Last Supper,
......@@ -747,7 +748,7 @@
747,Benny & Joon (1993),
748,Saint,
749,MatchMaker,
750,Amistad (1997),
750,Amistad (1997),4
751,Tomorrow Never Dies (1997),
752,Replacement Killers,
753,Burnt By the Sun (1994),
......@@ -765,10 +766,10 @@
765,Boomerang (1992),
766,Man of the Year (1995),
767,Addiction,
768,Casper (1995),
768,Casper (1995),1
769,Congo (1995),
770,Devil in a Blue Dress (1995),
771,Johnny Mnemonic (1995),
771,Johnny Mnemonic (1995),2
772,Kids (1995),
773,Mute Witness (1994),
774,Prophecy,
......@@ -899,7 +900,7 @@
899,Winter Guest,
900,Kundun (1997),
901,Mr. Magoo (1997),
902,Big Lebowski,
902,Big Lebowski,3
903,Afterglow (1997),
904,Ma vie en rose (My Life in Pink) (1997),
905,Great Expectations (1998),
......@@ -1062,7 +1063,7 @@
1062,Four Days in September (1997),
1063,Little Princess,
1064,Crossfire (1947),
1065,Koyaanisqatsi (1983),
1065,Koyaanisqatsi (1983),4
1066,Balto (1995),
1067,Bottle Rocket (1996),
1068,Star Maker,
......@@ -1124,7 +1125,7 @@
1124,Farewell to Arms,
1125,Innocents,
1126,Old Man and the Sea,
1127,Truman Show,
1127,Truman Show,1
1128,Heidi Fleiss: Hollywood Madam (1995),
1129,Chungking Express (1994),
1130,Jupiter's Wife (1994),
......@@ -1136,7 +1137,7 @@
1136,Ghosts of Mississippi (1996),
1137,Beautiful Thing (1996),
1138,Best Men (1997),
1139,Hackers (1995),
1139,Hackers (1995),2
1140,Road to Wellville,
1141,War Room,
1142,When We Were Kings (1996),
......@@ -1232,12 +1233,12 @@
1232,Madonna: Truth or Dare (1991),
1233,Nénette et Boni (1996),
1234,Chairman of the Board (1998),
1235,Big Bang Theory,
1235,Big Bang Theory,1
1236,Other Voices,
1237,Twisted (1996),
1238,Full Speed (1996),
1239,Cutthroat Island (1995),
1240,Ghost in the Shell (Kokaku kidotai) (1995),
1240,Ghost in the Shell (Kokaku kidotai) (1995),5
1241,Van,
1242,Old Lady Who Walked in the Sea,
1243,Night Flier (1997),
......
This directory should hold execution results.
addSbtPlugin("com.typesafe.sbt" % "sbt-native-packager" % "1.7.4")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.15.0")
#!/usr/bin/env bash
# If your default java install does not work, explicitly
# provide the path to the JDK 1.8 installation. On OSX
# with homebrew:
# export JAVA_HOME=/usr/local/Cellar/openjdk@8/1.8.0+282; ./run.sh
export JAVA_OPTS="-Xmx8G";
RUN=./logs/run-$(date "+%Y-%m-%d-%H:%M:%S")-$(hostname)
mkdir -p $RUN
LOGS=$RUN/log.txt
source ./config.sh
echo "------------------- OPTIMIZING ---------------------" >> $LOGS
sbt "runMain scaling.Optimizing --train $ML100Ku2base --test $ML100Ku2test --json $RUN/optimizing-100k.json --users 943 --movies 1682 --master local[1]" 2>&1 >>$LOGS
echo "------------------- DISTRIBUTED EXACT ---------------------" >> $LOGS
sbt "runMain distributed.Exact --train $ML100Ku2base --test $ML100Ku2test --json $RUN/exact-100k-4.json --k 10 --master local[4] --users 943 --movies 1682" 2>&1 >>$LOGS
sbt "runMain distributed.Exact --train $ML1Mrbtrain --test $ML1Mrbtest --separator :: --json $RUN/exact-1m-4.json --k 300 --master local[4] --users 6040 --movies 3952" 2>&1 >>$LOGS
echo "------------------- DISTRIBUTED APPROXIMATE ---------------------" >> $LOGS
sbt "runMain distributed.Approximate --train $ML100Ku2base --test $ML100Ku2test --json $RUN/approximate-100k-4-k10-r2.json --k 10 --master local[4] --users 943 --movies 1682 --partitions 10 --replication 2" 2>&1 >>$LOGS;
for R in 1 2 3 4 6 8; do
sbt "runMain distributed.Approximate --train $ML100Ku2base --test $ML100Ku2test --json $RUN/approximate-100k-4-k300-r$R.json --k 300 --master local[4] --users 943 --movies 1682 --partitions 10 --replication $R" 2>&1 >>$LOGS;
done
sbt "runMain distributed.Approximate --train $ML1Mrbtrain --test $ML1Mrbtest --separator :: --json $RUN/approximate-1m-4.json --k 300 --master local[4] --users 6040 --movies 3952 --partitions 8 --replication 1" 2>&1 >>$LOGS
echo "------------------- ECONOMICS -----------------------------------" >> $LOGS
sbt "runMain economics.Economics --json $RUN/economics.json" 2>&1 >>$LOGS
import org.rogach.scallop._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import breeze.linalg._
import breeze.numerics._
import scala.io.Source
import scala.collection.mutable.ArrayBuffer
import ujson._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
import shared.predictions._
package distributed {
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
val train = opt[String](required = true)
val test = opt[String](required = true)
val k = opt[Int]()
val json = opt[String]()
val users = opt[Int]()
val movies = opt[Int]()
val separator = opt[String](default=Some("\t"))
val replication = opt[Int](default=Some(1))
val partitions = opt[Int](default=Some(1))
val master = opt[String]()
val num_measurements = opt[Int](default=Some(1))
verify()
}
object Approximate {
def main(args: Array[String]) {
var conf = new Conf(args)
// Remove these lines if encountering/debugging Spark
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = conf.master.toOption match {
case None => SparkSession.builder().getOrCreate();
case Some(master) => SparkSession.builder().master(master).getOrCreate();
}
val sc = spark.sparkContext
println("")
println("******************************************************")
// conf object is not serializable, extract values that
// will be serialized with the parallelize implementations
val conf_users = conf.users()
val conf_movies = conf.movies()
val conf_k = conf.k()
println("Loading training data")
val train = loadSpark(sc, conf.train(), conf.separator(), conf.users(), conf.movies())
val test = loadSpark(sc, conf.test(), conf.separator(), conf.users(), conf.movies())
var knn : CSCMatrix[Double] = null
println("Partitioning users")
var partitionedUsers : Seq[Set[Int]] = partitionUsers(
conf.users(),
conf.partitions(),
conf.replication()
)
val measurements = (1 to scala.math.max(1,conf.num_measurements()))
.map(_ => timingInMs( () => {
// Use partitionedUsers here
0.0
}))
val mae = measurements(0)._1
val timings = measurements.map(_._2)
// Save answers as JSON
def printToFile(content: String,
location: String = "./answers.json") =
Some(new java.io.PrintWriter(location)).foreach{
f => try{
f.write(content)
} finally{ f.close }
}
conf.json.toOption match {
case None => ;
case Some(jsonFile) => {
val answers = ujson.Obj(
"Meta" -> ujson.Obj(
"train" -> ujson.Str(conf.train()),
"test" -> ujson.Str(conf.test()),
"k" -> ujson.Num(conf.k()),
"users" -> ujson.Num(conf.users()),
"movies" -> ujson.Num(conf.movies()),
"master" -> ujson.Str(sc.getConf.get("spark.master")),
"num-executors" -> ujson.Str(if (sc.getConf.contains("spark.executor.instances"))
sc.getConf.get("spark.executor.instances")
else
""),
"num_measurements" -> ujson.Num(conf.num_measurements()),
"partitions" -> ujson.Num(conf.partitions()),
"replication" -> ujson.Num(conf.replication())
),
"AK.1" -> ujson.Obj(
"knn_u1v1" -> ujson.Num(0.0),
"knn_u1v864" -> ujson.Num(0.0),
"knn_u1v344" -> ujson.Num(0.0),
"knn_u1v16" -> ujson.Num(0.0),
"knn_u1v334" -> ujson.Num(0.0),
"knn_u1v2" -> ujson.Num(0.0)
),
"AK.2" -> ujson.Obj(
"mae" -> ujson.Num(mae)
),
"AK.3" -> ujson.Obj(
"average (ms)" -> ujson.Num(mean(timings)),
"stddev (ms)" -> ujson.Num(std(timings))
)
)
val json = write(answers, 4)
println(json)
println("Saving answers in: " + jsonFile)
printToFile(json, jsonFile)
}
}
println("")
spark.stop()
}
}
}
import org.rogach.scallop._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import breeze.linalg._
import breeze.numerics._
import scala.io.Source
import scala.collection.mutable.ArrayBuffer
import ujson._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
import shared.predictions._
package distributed {
class ExactConf(arguments: Seq[String]) extends ScallopConf(arguments) {
val train = opt[String](required = true)
val test = opt[String](required = true)
val k = opt[Int](default=Some(10))
val json = opt[String]()
val users = opt[Int]()
val movies = opt[Int]()
val separator = opt[String](default=Some("\t"))
val master = opt[String]()
val num_measurements = opt[Int](default=Some(1))
verify()
}
object Exact {
def main(args: Array[String]) {
var conf = new ExactConf(args)
// Remove these lines if encountering/debugging Spark
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = conf.master.toOption match {
case None => SparkSession.builder().getOrCreate();
case Some(master) => SparkSession.builder().master(master).getOrCreate();
}
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
println("")
println("******************************************************")
// conf object is not serializable, extract values that
// will be serialized with the parallelize implementations
val conf_users = conf.users()
val conf_movies = conf.movies()
val conf_k = conf.k()
println("Loading training data from: " + conf.train())
val train = loadSpark(sc, conf.train(), conf.separator(), conf.users(), conf.movies())
val test = loadSpark(sc, conf.test(), conf.separator(), conf.users(), conf.movies())
val measurements = (1 to scala.math.max(1,conf.num_measurements())).map(_ => timingInMs( () => {
0.0
}))
val timings = measurements.map(_._2)
// Save answers as JSON
def printToFile(content: String,
location: String = "./answers.json") =
Some(new java.io.PrintWriter(location)).foreach{
f => try{
f.write(content)
} finally{ f.close }
}
conf.json.toOption match {
case None => ;
case Some(jsonFile) => {
val answers = ujson.Obj(
"Meta" -> ujson.Obj(
"train" -> ujson.Str(conf.train()),
"test" -> ujson.Str(conf.test()),
"k" -> ujson.Num(conf.k()),
"users" -> ujson.Num(conf.users()),
"movies" -> ujson.Num(conf.movies()),
"master" -> ujson.Str(sc.getConf.get("spark.master")),
"num-executors" -> ujson.Str(if (sc.getConf.contains("spark.executor.instances"))
sc.getConf.get("spark.executor.instances")
else
""),
"num_measurements" -> ujson.Num(conf.num_measurements())
),
"EK.1" -> ujson.Obj(
"1.knn_u1v1" -> ujson.Num(0.0),
"2.knn_u1v864" -> ujson.Num(0.0),
"3.knn_u1v886" -> ujson.Num(0.0),
"4.PredUser1Item1" -> ujson.Num(0.0),
"5.PredUser327Item2" -> ujson.Num(0.0),
"6.Mae" -> ujson.Num(0.0)
),
"EK.2" -> ujson.Obj(
"average (ms)" -> ujson.Num(mean(timings)), // Datatype of answer: Double
"stddev (ms)" -> ujson.Num(std(timings)) // Datatype of answer: Double
)
)
val json = write(answers, 4)
println(json)
println("Saving answers in: " + jsonFile)
printToFile(json, jsonFile)
}
}
println("")
spark.stop()
}
}
}
import org.rogach.scallop._
import breeze.linalg._
import breeze.numerics._
import scala.io.Source
import scala.collection.mutable.ArrayBuffer
import ujson._
package economics {
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
val json = opt[String]()
verify()
}
object Economics {
def main(args: Array[String]) {
println("")
println("******************************************************")
var conf = new Conf(args)
// Save answers as JSON
def printToFile(content: String,
location: String = "./answers.json") =
Some(new java.io.PrintWriter(location)).foreach{
f => try{
f.write(content)
} finally{ f.close }
}
conf.json.toOption match {
case None => ;
case Some(jsonFile) => {
val answers = ujson.Obj(
"E.1" -> ujson.Obj(
"MinRentingDays" -> ujson.Num(0.0) // Datatype of answer: Double
),
"E.2" -> ujson.Obj(
"ContainerDailyCost" -> ujson.Num(0.0),
"4RPisDailyCostIdle" -> ujson.Num(0.0),
"4RPisDailyCostComputing" -> ujson.Num(0.0),
"MinRentingDaysIdleRPiPower" -> ujson.Num(0.0),
"MinRentingDaysComputingRPiPower" -> ujson.Num(0.0)
),
"E.3" -> ujson.Obj(
"NbRPisEqBuyingICCM7" -> ujson.Num(0.0),
"RatioRAMRPisVsICCM7" -> ujson.Num(0.0),
"RatioComputeRPisVsICCM7" -> ujson.Num(0.0)
)
)
val json = write(answers, 4)
println(json)
println("Saving answers in: " + jsonFile)
printToFile(json, jsonFile)
}
}
println("")
}
}
}
import org.rogach.scallop._
import breeze.linalg._
import breeze.numerics._
import scala.io.Source
import scala.collection.mutable.ArrayBuffer
import ujson._
import shared.predictions._
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
package scaling {
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
val train = opt[String](required = true)
val test = opt[String](required = true)
val json = opt[String]()
val users = opt[Int]()
val movies = opt[Int]()
val separator = opt[String](default=Some("\t"))
val master = opt[String]()
val num_measurements = opt[Int](default=Some(1))
verify()
}
object Optimizing extends App {
var conf = new Conf(args)
// conf object is not serializable, extract values that
// will be serialized with the parallelize implementations
val conf_users = conf.users()
val conf_movies = conf.movies()
// Remove these lines if encountering/debugging Spark
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = conf.master.toOption match {
case None => SparkSession.builder().getOrCreate();
case Some(master) => SparkSession.builder().master(master).getOrCreate();
}
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
println("Loading training data from: " + conf.train())
val train = loadSpark(sc, conf.train(), conf.separator(), conf.users(), conf.movies())
val test = loadSpark(sc, conf.test(), conf.separator(), conf.users(), conf.movies())
val measurements = (1 to conf.num_measurements()).map(x => timingInMs(() => {
0.0
}))
val timings = measurements.map(t => t._2)
val mae = measurements(0)._1
// Save answers as JSON
def printToFile(content: String,
location: String = "./answers.json") =
Some(new java.io.PrintWriter(location)).foreach{
f => try{
f.write(content)
} finally{ f.close }
}
conf.json.toOption match {
case None => ;
case Some(jsonFile) => {
val answers = ujson.Obj(
"Meta" -> ujson.Obj(
"train" -> ujson.Str(conf.train()),
"test" -> ujson.Str(conf.test()),
"users" -> ujson.Num(conf.users()),
"movies" -> ujson.Num(conf.movies()),
"master" -> ujson.Str(conf.master()),
"num_measurements" -> ujson.Num(conf.num_measurements())
),
"BR.1" -> ujson.Obj(
"1.k10u1v1" -> ujson.Num(0.0),
"2.k10u1v864" -> ujson.Num(0.0),
"3.k10u1v886" -> ujson.Num(0.0),
"4.PredUser1Item1" -> ujson.Num(0.0),
"5.PredUser327Item2" -> ujson.Num(0.0),
"6.Mae" -> ujson.Num(0.0)
),
"BR.2" -> ujson.Obj(
"average (ms)" -> ujson.Num(mean(timings)), // Datatype of answer: Double
"stddev (ms)" -> ujson.Num(std(timings)) // Datatype of answer: Double
)
)
val json = write(answers, 4)
println(json)
println("Saving answers in: " + jsonFile)
printToFile(json, jsonFile)
}
}
println("")
}
}
package predict
import org.rogach.scallop._
import org.json4s.jackson.Serialization
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
val train = opt[String](required = true)
val test = opt[String](required = true)
val json = opt[String]()
verify()
}
case class Rating(user: Int, item: Int, rating: Double)
object Predictor extends App {
// Remove these lines if encountering/debugging Spark
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession.builder()
.master("local[1]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
println("")
println("******************************************************")
var conf = new Conf(args)
println("Loading training data from: " + conf.train())
val trainFile = spark.sparkContext.textFile(conf.train())
val train = trainFile.map(l => {
val cols = l.split("\t").map(_.trim)
Rating(cols(0).toInt, cols(1).toInt, cols(2).toDouble)
})
assert(train.count == 80000, "Invalid training data")
println("Loading test data from: " + conf.test())
val testFile = spark.sparkContext.textFile(conf.test())
val test = testFile.map(l => {
val cols = l.split("\t").map(_.trim)
Rating(cols(0).toInt, cols(1).toInt, cols(2).toDouble)
})
assert(test.count == 20000, "Invalid test data")
val globalPred = 3.0
val globalMae = test.map(r => scala.math.abs(r.rating - globalPred)).reduce(_+_) / test.count.toDouble
// Save answers as JSON
def printToFile(content: String,
location: String = "./answers.json") =
Some(new java.io.PrintWriter(location)).foreach{
f => try{
f.write(content)
} finally{ f.close }
}
conf.json.toOption match {
case None => ;
case Some(jsonFile) => {
var json = "";
{
// Limiting the scope of implicit formats with {}
implicit val formats = org.json4s.DefaultFormats
val answers: Map[String, Any] = Map(
"3.1.4" -> Map(
"global-mae" -> globalMae
)
)
json = Serialization.writePretty(answers)
}
println(json)
println("Saving answers in: " + jsonFile)
printToFile(json, jsonFile)
}
}
println("")
spark.close()
}
package recommend
import org.rogach.scallop._
import org.json4s.jackson.Serialization
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
val data = opt[String](required = true)
val personal = opt[String](required = true)
val json = opt[String]()
verify()
}
case class Rating(user: Int, item: Int, rating: Double)
object Recommender extends App {
// Remove these lines if encountering/debugging Spark
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession.builder()
.master("local[1]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
println("")
println("******************************************************")
var conf = new Conf(args)
println("Loading data from: " + conf.data())
val dataFile = spark.sparkContext.textFile(conf.data())
val data = dataFile.map(l => {
val cols = l.split("\t").map(_.trim)
Rating(cols(0).toInt, cols(1).toInt, cols(2).toDouble)
})
assert(data.count == 100000, "Invalid data")
println("Loading personal data from: " + conf.personal())
val personalFile = spark.sparkContext.textFile(conf.personal())
// TODO: Extract ratings and movie titles
assert(personalFile.count == 1682, "Invalid personal data")
// Save answers as JSON
def printToFile(content: String,
location: String = "./answers.json") =
Some(new java.io.PrintWriter(location)).foreach{
f => try{
f.write(content)
} finally{ f.close }
}
conf.json.toOption match {
case None => ;
case Some(jsonFile) => {
var json = "";
{
// Limiting the scope of implicit formats with {}
implicit val formats = org.json4s.DefaultFormats
val answers: Map[String, Any] = Map(
"4.1.1" -> List[Any](
List(0,"Tron", 5.0),
List(0,"Tron", 5.0),
List(0,"Tron", 5.0),
List(0,"Tron", 5.0),
List(0,"Tron", 5.0)
)
)
json = Serialization.writePretty(answers)
}
println(json)
println("Saving answers in: " + jsonFile)
printToFile(json, jsonFile)
}
}
println("")
spark.close()
}
package shared
import breeze.linalg._
import breeze.numerics._
import scala.io.Source
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.SparkContext
package object predictions
{
// ------------------------ For template
case class Rating(user: Int, item: Int, rating: Double)
def timingInMs(f : ()=>Double ) : (Double, Double) = {
val start = System.nanoTime()
val output = f()
val end = System.nanoTime()
return (output, (end-start)/1000000.0)
}
def toInt(s: String): Option[Int] = {
try {
Some(s.toInt)
} catch {
case e: Exception => None
}
}
def mean(s :Seq[Double]): Double = if (s.size > 0) s.reduce(_+_) / s.length else 0.0
def std(s :Seq[Double]): Double = {
if (s.size == 0) 0.0
else {
val m = mean(s)
scala.math.sqrt(s.map(x => scala.math.pow(m-x, 2)).sum / s.length.toDouble)
}
}
def load(path : String, sep : String, nbUsers : Int, nbMovies : Int) : CSCMatrix[Double] = {
val file = Source.fromFile(path)
val builder = new CSCMatrix.Builder[Double](rows=nbUsers, cols=nbMovies)
for (line <- file.getLines) {
val cols = line.split(sep).map(_.trim)
toInt(cols(0)) match {
case Some(_) => builder.add(cols(0).toInt-1, cols(1).toInt-1, cols(2).toDouble)
case None => None
}
}
file.close
builder.result()
}
def loadSpark(sc : org.apache.spark.SparkContext, path : String, sep : String, nbUsers : Int, nbMovies : Int) : CSCMatrix[Double] = {
val file = sc.textFile(path)
val ratings = file
.map(l => {
val cols = l.split(sep).map(_.trim)
toInt(cols(0)) match {
case Some(_) => Some(((cols(0).toInt-1, cols(1).toInt-1), cols(2).toDouble))
case None => None
}
})
.filter({ case Some(_) => true
case None => false })
.map({ case Some(x) => x
case None => ((-1, -1), -1) }).collect()
val builder = new CSCMatrix.Builder[Double](rows=nbUsers, cols=nbMovies)
for ((k,v) <- ratings) {
v match {
case d: Double => {
val u = k._1
val i = k._2
builder.add(u, i, d)
}
}
}
return builder.result
}
def partitionUsers (nbUsers : Int, nbPartitions : Int, replication : Int) : Seq[Set[Int]] = {
val r = new scala.util.Random(1337)
val bins : Map[Int, collection.mutable.ListBuffer[Int]] = (0 to (nbPartitions-1))
.map(p => (p -> collection.mutable.ListBuffer[Int]())).toMap
(0 to (nbUsers-1)).foreach(u => {
val assignedBins = r.shuffle(0 to (nbPartitions-1)).take(replication)
for (b <- assignedBins) {
bins(b) += u
}
})
bins.values.toSeq.map(_.toSet)
}
}
package stats
import org.rogach.scallop._
import org.json4s.jackson.Serialization
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
val data = opt[String](required = true)
val json = opt[String]()
verify()
}
case class Rating(user: Int, item: Int, rating: Double)
object Analyzer extends App {
// Remove these lines if encountering/debugging Spark
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession.builder()
.master("local[1]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
println("")
println("******************************************************")
var conf = new Conf(args)
println("Loading data from: " + conf.data())
val dataFile = spark.sparkContext.textFile(conf.data())
val data = dataFile.map(l => {
val cols = l.split("\t").map(_.trim)
Rating(cols(0).toInt, cols(1).toInt, cols(2).toDouble)
})
assert(data.count == 100000, "Invalid data")
// Save answers as JSON
def printToFile(content: String,
location: String = "./answers.json") =
Some(new java.io.PrintWriter(location)).foreach{
f => try{
f.write(content)
} finally{ f.close }
}
conf.json.toOption match {
case None => ;
case Some(jsonFile) => {
var json = "";
{
// Limiting the scope of implicit formats with {}
implicit val formats = org.json4s.DefaultFormats
val answers: Map[String, Any] = Map(
"3.1.1" -> Map(
"global-avg-rating" -> 3.0
)
)
json = Serialization.writePretty(answers)
}
println(json)
println("Saving answers in: " + jsonFile)
printToFile(json, jsonFile)
}
}
println("")
spark.close()
}
package test
import org.scalatest._
import funsuite._
import test.optimizing._
import test.distributed._
class AllTests extends Sequential(
new OptimizingTests,
new ExactTests,
new ApproximateTests
)