Movielens Dataset

This folder contains scripts and tasks to import the movies-100k dataset that contains 100.000 ratings from nearly 1.000 users for about 1.700 different movies, all part of the Movielens.org website. It also allows to download and import the movies-1M dataset with 1 million ratings from 6.000 users on 4.000 movies.

The repository should be available at /vagrant inside the VM. First connect into the VM and set up the data set.

To connect into the VM run:

$ vagrant ssh

The following commands are run inside the VM.

$ cd /vagrant/dataset/movies-100k
$ gem install bundler
$ bundle install

First bundler is installed, then all required gem dependencies.

Inside the folder dataset/movies-100k there is a Rakefile that provides a number of tasks. To display a list of all available rake tasks run:

$ bundle exec rake -T

Dataset

Go to the data set folder and run the following commands to upload the dataset to Elasticsearch

$ cd /vagrant/dataset/movies-100k

First we create an Elasticsearch index to store the data set and define the mappings for all types. We use the elasticsearch-rake-tasks gem and run the following command:

$ bundle exec rake es:movies:create[http://localhost:9200,movies]

This creates a new index named "movies" at the local Elasticsearch instance and applies the template with the same name.

Then run the rake task, for the movies 100k data set:

$ bundle exec rake create_100k_data_set

For the data set containing 1M ratings use

$ bundle exec rake create_1m_data_set

This first downloads the movies100k / 1M data set to a tmp folder, then extracts and transforms all the users, genres, movies and ratings from the data set and creates a JSON file compatible with the Elasticsearch Bulk API.

The last step is to bulk upload the generated seed file to Elasticsearch, which is done by:

$ curl -X POST 'http://localhost:9200/movies/_bulk' --data-binary @item_seed.json > /dev/null

This might fail for the 1M documents bulk file. Alternatively use the rake command to bulk upload which takes a bit longer:

$ bundle exec rake upload_bulk

This uploads all entries from the seed.json file to the Elasticsearch index named movies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Movielens Dataset

Dataset

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Movielens Dataset

Dataset