Skip to content

(Ongoing notebook) This is a kaggle competition. With more than 37 million users log data we want to predict the hotel cluster given features such as hotel country, user country, check-in check-out dates. The data is big enough that the classic Pandas, and Scikit-learn will not work out for analysis and building ML models; instead, I will be usi…

Notifications You must be signed in to change notification settings

hamedrazavi/expedia_hotel_recommendations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

This is an ongoing notebook, from a kaggle competition (https://www.kaggle.com/c/expedia-hotel-recommendations/data). The goal is to predict the prefered hotel cluster of the users. I keep updating this notebook.

The training data is a dataset of size 3.8 GB of more than 37 million samples (users log data) which includes information such as the user country, hotel country, search date of checkin, search date of check out, etc.

The test set includes more that 250k users log data.

Due to the size of the training data, python libraries such as pandas and sklearn will fail, so here I use the python wrapper of Apache Spark (pyspark). Also, I will use the distributed computing service of Amazon, Elastic Map Reduce (EMR), for faster analysis and building of machine learning models.

About

(Ongoing notebook) This is a kaggle competition. With more than 37 million users log data we want to predict the hotel cluster given features such as hotel country, user country, check-in check-out dates. The data is big enough that the classic Pandas, and Scikit-learn will not work out for analysis and building ML models; instead, I will be usi…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published