Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

Latest commit

 

History

History
63 lines (53 loc) · 6.82 KB

README.md

File metadata and controls

63 lines (53 loc) · 6.82 KB

Reproducing HybridSVD paper

This repository contains full source code for reproducing results from the HybridSVD paper. If you want to run it on your own machine, make sure to prepare conda environment according to this configuation file, which contains the list of all required packages (including their versions).

You can also interactively run experiments directly in your browser with the help of Binder cloud technologies. Simply click on the badge below to get started:

badge

This will launch interactive JupyterLab environment with access to all repository files. By default it starts with the HybridSVD.ipynb notebook that contains the code for HybridSVD model evaluated on the Movielens and Bookcrossing datasets.

Mind cloud environment restrictions

Due to restrictions on Binder's cloud resources only small datasets, e.g., Movielens-1M or Amazon Video Games, allow performing full experiments without interruption. Attempts to work with larger files will likely crash the environment. Originally all experiments were conducted on HPC servers with much larger amount of hardware resources. It is, therefore, advised to make the following modifications to run jupyter notebooks safely in the Binder cloud:

Working with Movielens-1M data

Experiments with this dataset are available in the following files:

  • Baselines.ipynb
  • HybridSVD.ipynb
  • FactorizationMachines.ipynb
  • LCE.ipynb
  • ScaledSVD.ipynb
  • ScaledHybridSVD.ipynb

You need to change the data_labels variable in the Experiment setup section of each notebook from

data_labels = ['ML1M', 'ML10M', 'BX']

to

data_labels = ['ML1M']

Accordingly, do not run cells under Movielens10M and BookCrossing headers (these datasets are not provided in the cloud environment). Also make sure that the first argument to the get_movielens_data is ../datasets/movielens/ml-1m.zip (originally the notebooks were executed on several machines that's why the path may vary), e.g., it should start as:

data_dict[lbl], meta_dict[lbl] = get_movielens_data('../datasets/movielens/ml-1m.zip',
                                                     <other arguments>

Working with Amazon Video Games data

Experiments with this dataset are available in the following files:

  • Baselines_AMZ.ipynb
  • HybridSVD_AMZ.ipynb
  • FactorizationMachines_AMZ.ipynb
  • LCE_AMZ.ipynb
  • ScaledSVD_AMZ.ipynb
  • ScaledHybridSVD_AMZ.ipynb

You need to change the data_labels variable in the Experiment setup section from

data_labels = ['AMZe', 'AMZvg']

to

data_labels = ['AMZvg']

Accordingly, do not run cells under AMZe header. Again, make sure to provide correct input arguments to the get_amazon_data. In this case they are:

data_dict[lbl], meta_dict[lbl] = get_amazon_data('../datasets/amazon/ratings_Video_Games.csv',
                                                 meta_path='../datasets/amazon/meta/meta_Video_Games.json.gz',
                                                 <other arguments>

Reducing training time

Keep in mind that some models require much longer training time than others. For example, the whole experiment for HybridSVD in both standard and cold start scenarios on the Movielens-1M dataset completes even before the initial tuning of Factorization Machines is done for standard scenario. As Binder automatically shuts down long running tasks you may not be able to perform all computations before the timeout. To reduce the risk of such shutdown you may want to run different notebooks (different models) in independent Binder sessions. You may also want to reduce the number of points to consider in the random grid search for tuning non SVD-based models. For example, in the FM case you can change the ntrial=60 input to ntrials=30 in the fine_tune_fm(model, params, label, ntrials=60) function calls. This may, however, slightly decrease the resulting quality of FM.

Alternatively, you can skip parameter tuning sections for long-running models and reuse previously found set of nearly optimal hyper-parameters. They are printed in the end of each section with model tuning. You can also find them in the View optimal parameters notebook.