Note: The Recommendation System will utilize the data from yelp.com
- train_review.json – the main file that contains the review data, RS will primarily be working with this file.
- test_review.json – containing only the target user and business pairs for prediction tasks
- test_review_ratings.json – containing the ground truth rating for the testing pairs
- stopwords - containing common stopwords that will be used when calculating TFIDF score.
- The file is preprocessed first using Apache Spark
The Recommendation System will be divided into four subfolders, each uses different algorithm to accomplish the recommendation.
-
Collaborative Filtering: Collaborative Filtering Recommendation System that has two cases: Item-based CF and User-based CF.
- Item-based CF: the RS is built by computing the Pearson correlation for the business pairs with at least three co-rated users and use 3 or 5 neighbors who are most similar to targeted business.
- User-based CF: MinHash and LSH is used first to identify similar users to reduce the number of pairs needed to compute Pearson Correlation. After identifying the similar users based on their jaccard similarity, RS will compute the Pearson Correlation for all candidates user pairs and make the prediction.
-
Content-Based Recommendation Sys: The content-based RS which will generate profiles from review texts for users and businesses in the train_review.json file. Algorithms used are: Calculation of TF-IDF score and Cosine Similarity.
-
Finding Similar Items: Find similar business pairs in the train_review.json file. Algorithms used are: MinHash and Locality Sensitive Hashing, Jaccard Similarity
-
Hybrid Recommendation Sys: The hybrid recommendation system that utilizes several different models and produce the best result jointly. This project is also ranked the third place at USC Data Mining (Recommendation System) Competition 2021 with final score of 2709 and RMSE of 1.1498
- Similar Items:
- b1 and b2 are the business id
- sim is the jaccard similarity of b1 and b2
- Content-based RS:
- user_id and business_id pair means 'if a user would prefer to review a business'
- sim is the calculated(predicted) cosine distance between the profile vectors.
- User-based CF Pearson Correlation Model:
- u1 and u2 are the user id
- sim is the Pearson Correlation between these two users
- Item-based CF Pearson Correlation Model:
- b1 and b2 are the business id
- sim is the Pearson Correlation between these two business
- CF prediction result:
- user_id and business_id stands for 'this user will likely rate this business with this star'
- stars is simply the predicted rating
- Similar business pairs
- precision: 1.0
- recall: 0.9582400942205771
- Content-based RS
- precision (test set): 1.0
- recall (test set): 0.999469477863536
- CF model
- item-based CF model
- precision: 0.9641450981844213
- recall: 0.9805068470797926
- user-based CF model
- precision: 0.9573746593617223
- recall: 0.8276633759390503
- item-based CF model
- CF prediction
- item-based RMSE (test set): 0.9023539405054186
- user-based RMSE (test set): 0.9901023647008427
- Hybrid Recommendation System:
- Blind test set RMSE: 1.1498
- Test set RMSE: 1.14166