This folder contains scripts and tasks to import the movies-100k dataset that contains 100.000 ratings from nearly 1.000 users for about 1.700 different movies, all part of the Movielens.org website. It also allows to download and import the movies-1M dataset with 1 million ratings from 6.000 users on 4.000 movies.
The repository should be available at /vagrant
inside the VM. First connect into the VM and set up the data set.
To connect into the VM run:
$ vagrant ssh
The following commands are run inside the VM.
$ cd /vagrant/dataset/movies-100k
$ gem install bundler
$ bundle install
First bundler is installed, then all required gem dependencies.
Inside the folder dataset/movies-100k
there is a Rakefile that provides a number of tasks. To display a list of all available rake tasks run:
$ bundle exec rake -T
Go to the data set folder and run the following commands to upload the dataset to Elasticsearch
$ cd /vagrant/dataset/movies-100k
First we create an Elasticsearch index to store the data set and define the mappings for all types. We use the elasticsearch-rake-tasks gem and run the following command:
$ bundle exec rake es:movies:create[http://localhost:9200,movies]
This creates a new index named "movies" at the local Elasticsearch instance and applies the template with the same name.
Then run the rake task, for the movies 100k data set:
$ bundle exec rake create_100k_data_set
For the data set containing 1M ratings use
$ bundle exec rake create_1m_data_set
This first downloads the movies100k / 1M data set to a tmp folder, then extracts and transforms all the users, genres, movies and ratings from the data set and creates a JSON file compatible with the Elasticsearch Bulk API.
The last step is to bulk upload the generated seed file to Elasticsearch, which is done by:
$ curl -X POST 'http://localhost:9200/movies/_bulk' --data-binary @item_seed.json > /dev/null
This might fail for the 1M documents bulk file. Alternatively use the rake command to bulk upload which takes a bit longer:
$ bundle exec rake upload_bulk
This uploads all entries from the seed.json
file to the Elasticsearch index named movies
.