Spark ML Movies Recommendation and Budget Forecast

This is a demo Python application with web-scrapping/crawler, Kafka, couchdb and SparkML using the famous Movielens dataset. The used environment for it was Raspberry Pi Cluster with Hadoop, Spark and other technologies.
In this project you will find two pipelines:

The Recommendation Pipeline: That is the standard movies recommendation example using the MovieLens databse with Spark ML.
The Budget Prediction Pipeline: That uses the links.csv file to crawl additional data from movies into the IMDB website as Directors, Budget Value, etc. In this pipeline you will find:
- A python scrapper that collect these data from the IMDB website and store each message into a Kafka queue, with the following format:
```
{'Director': 'Hiner Saleem', 'Writers': 'Hiner Saleem,Antoine Lacomblez', 'Reviews': '14', 'Critic': '37', 'Country': 'Iraq|France|Germany', 'Language': 'Kurdish|Arabic|Turkish', 'BudgetCurrency': 'EUR', 'BudgetValue': '2,600,000', 'Runtime': '100', 'Actors': '', 'IdMovie': '127244', 'IdIMDB': '2875926', 'GrossCurrency': 'NA', 'GrossValue': 0, 'OpeningWeekendCurrency': 'NA', 'OpeningWeekendValue': 0} 
```
- A Kafka consumer that sink the data into a CouchDB database (python script);
- A Spark process that consume these data to apply a data preparation and create an ABT (Analytical Base Table);
- A Spark ML pipeline to training the models GBTRegressor, RandomForestRegressor, DecisionTreeRegressor for comparison.
- Example in how to expose a Spark model as a REST API, in this case using Spark itself to generate the reponses;

Important: All the used references as support are mentioned in the presentation (ppt folder)

The Dataset

You can download it here.

API

The API is using Spark to generate the predictions, so it needs to run a spark-submit.

Reference:
- You can find more details in this very good example, that was used as baseline: Building a Movie Recommendation Service with Apache Spark & Flask - Part 2
How to execute:

spark-submit srvserver.py --master yarn --deploy-mode client --num-executors 3

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ALSModel		ALSModel
GBTModel		GBTModel
api		api
crawler		crawler
images		images
notebooks		notebooks
ppt		ppt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark ML Movies Recommendation and Budget Forecast

The Dataset

API

Enjoy! :)

About

Releases

Packages

Languages

MauricioLins/movies-prediction-example

Folders and files

Latest commit

History

Repository files navigation

Spark ML Movies Recommendation and Budget Forecast

The Dataset

API

Enjoy! :)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages