- 2024/08-09: Implementation of Multimodal embeddings and Two-Tower Model
- 2024/07: Design of Migrating to Multimodal Top-K Personalized Ranking Recommendation
- 2024/03-05: Legacy Score Prediction Hybrid Model
Implemented notebook multimodalRanking.ipynb
, covers several aspects mentioned in the TODO_EN.md
.
Showcased the usage of multimodal methods (such as BERT/USE/CLIP) to handle various data sources and use these features for downstream tasks like rank-based recommendations and classification.
- Review Embeddings: Extracted using the BERT model. Each review is processed by the
get_text_embedding
function, generating a 512-dimensional vector. - Image Embeddings: Extracted using the CLIP model. Each image is processed by the
get_image_embedding
function, generating a 512-dimensional vector.
During training, review and image embeddings are combined with other user and business features. The specific steps are as follows:
- User Features: Extracted from user data (e.g., behavioral characteristics).
- Business Features: Extracted from business data (e.g., geographic location, categories).
The YelpDataset
class builds the dataset by combining user indices, positive business indices, and negative business indices. Each sample consists of:
- User Features: Retrieved from the user feature dictionary.
- Positive Business Features: Retrieved from the business feature dictionary.
- Negative Business Features: Retrieved from the business feature dictionary.
During training, the get_user_features
and get_business_features
functions are called to retrieve user and business features, including the review and image embeddings. The steps are:
- User Features: Retrieved using the
get_user_features
function, returning a feature vector for the user. - Business Features: Retrieved using the
get_business_features
function, which includes business features, review embeddings, and image embeddings.
In the forward
method of the TwoTowerModel
, user and business features are fed into the model, passing through the user tower and business tower respectively. These features generate user vectors and business vectors, which are then used to compute contrastive loss for training.
The review and image embeddings are fed into the recommendation model through a multi-step process of extraction, feature combination, and dataset construction. These embeddings, along with other user and business features, serve as input features for model training and inference.
This notebook covers several aspects, including recommendation systems, text analysis, and image classification. It demonstrates how to leverage deep learning models (such as BERT and CLIP) to handle various data types and use these features for downstream tasks like recommendation and classification.
- Define the Objectives
- Aim: personalized ranked order of business for each user
- Metric: NDCG relevance level: strength of interaction (2.0 > 1.0)
- Group Definition: based on both user and location
- ISSUE: lack of geo/implicit interaction data;
- Negative Sampling for the hybrid user-business model
- ISSUE: not considering all non-interacted businesses as negatives, lack of a recall step with a selection criteria to pick only the recalled candidates for model training.
- ISSUE: intensive analysis such as location radius, user preferences, etc., are required.
- Scaling predictions
- BLAS, file-based processing, etc.
Further work shall focus on integrating with existing ETL feature-eng scripts using max/avg pooling, evaluating XGB LambdaMART approach, refined negative sampling with Recall (more interaction data release by Yelp), model fine-tuning or comparison.
Others:
- Features & Negative sampling: consider Word2Vec (https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
- Text Similarity: Contextual / Semantic-Matching / Attention-based Model
- Image Classification: EfficientNet/ImageNet
Graph1: Cosine Similarity Heatmap of Review Embeddings across various categories applied on a sample of Yelp Review data
This plot showcases the cosine similarity between review embeddings from various categories. By computing the similarities between these embeddings and visualizing them as a heatmap, we can observe the following:
- Same-category reviews: Reviews within the same category tend to be closer to each other in vector space, reflected by brighter areas in the heatmap.
- Different-category reviews: Reviews from different categories are relatively farther apart in vector space, shown as darker areas in the heatmap. This indicates that the model successfully captures the similarity between reviews, especially within the same category.
Graph 2: CLIP Photo Cliassification Precision Table
Content: This plot is a confusion matrix. It shows the precision with which the CLIP model predicted the hand-labeled Yelp dataset. More importantly, it shows which categories get mixed up together
- X-axis: Represents the predicted labels by the model.
- Y-axis: Represents the true labels.
Graph 3: Home Services Contractor Classifier ('Inside' and 'Outside')
Assess how accurately the model recognizes images related to home services. CLIP offers the possibility to remove the restraint of needing a large number of examples for each category the model infers. A review of the possible output classes from CLIP will lead to more diversified content tags on Yelp.
Graph 4: (BERT) A more complete overview of text-based cosine similarity of Yelp's review embeddings Similar to Graph 1.
Graph 5: (USE) A more complete overview of text-based cosine similarity of Yelp's review embeddings
The new document TODO_EN.md
introduces a ranking-focused mixed recommendation system that utilizes multimodal data (text and images) to generate embeddings and employs XGBoost's LambdaMART for ranking optimization. Current version is a regressive model, and the new version will be a ranking model. README.md
explains the previous version, and TODO_EN.md
explains the new version. Here is a snapshot of the new version:
The project aims to implement a ranking-focused mixed recommendation system that utilizes multimodal data (text and images) to generate embeddings and employs XGBoost's LambdaMART for ranking optimization. The model's task is to recommend personalized ranked order of business for each user. The main optimization goal is to enhance recommendation accuracy with multi-modal content-based embeddings, especially in cold-start user scenarios and visually-driven business contexts.
- Python, Spark: For data processing and model training
- XGBoost (LambdaMART): For ranking optimization
- Universal Sentence Encoder (USE): For generating text embeddings
- CLIP: For generating image embeddings
- ALS (Alternating Least Squares): For implicit matrix factorization
Switch the existing regression target to a ranking optimization target, using XGBoost's LambdaMART to implement a ranking-based recommendation system.
- Evaluation Metrics: Use NDCG and MAP as the main evaluation metrics, replacing the original RMSE (Root Mean Square Error).
- Grouped Processing: Define a group of businesses for each user and ensure optimal sorting of business results generated for the same user during training.
Regressive hybrid Recommendation Model: An offline, scalable, supervised recommendation model for user-to-business ratings
Mainly SparkRDD implemented. Pandas DF as feature combination and as Model input.
Hybrid: Supervised learning to combine both approaches:
- Content-based
- Collaborative
Terms:
- Models: ALS Matrix Factorization approach, K-means, XGBRegressor, (Graph Node2Vec, Universal Sentence Encoder)
- Packages/framework: SparkRDD, spark.ml, (gensim, spacy, tensorflow, ... )
- Scaling: K-means Clusters identifying localized user-business groups
- Fine-Tuning: Bayesian Optimization Search (hyperopt version for spark-based)
- Evaluation Metric: RMSE
This script implements ALS (Alternating Least Squares) for matrix factorization to derive feature vectors for users and businesses based on implicit interactions from tips.json
. Additionally, it includes a logarithmic transformation to standardize raw interaction counts into a more normalized rating scale (from 1 to 5).
-
Matrix Factorization:
- Uses
ALS.trainImplicit
to perform matrix decomposition and obtain feature vectors for users and businesses. - Defaults scores are used for user-business pairs where feature vectors are not available.
- Uses
-
Logarithmic Transformation:
This script enhances feature extraction by integrating category information into the user-business scoring data, preparing it for clustering algorithms like KMeans.
-
Category Processing:
- The
process_unique_category
function handles and integrates unique category and city data. - Merges review and category data for further processing.
- The
-
Inverse Logarithmic Transformation:
To enhance model sensitivity and adjust the scoring scale for specific applications, scores are magnified post-transformation. For instance:
score = float(50 * inverse_log_transform(x[1][1])) if x[1][1] else 0
In this project, I employ KMeans clustering to understand and segment users based on their geographical interaction patterns with businesses. This clustering approach helps in tailoring marketing strategies and enhancing model accuracy by adapting to localized user behaviors.
This script performs clustering based on the cities visited by the users. By analyzing these patterns, Model can group users with similar geographical preferences which is valuable for localized marketing and personalized recommendation systems.
Steps Involved:
- City Visits Extraction: Identify each city visited by users for each business interaction.
- Visit Count Calculation: Compute how often each city is visited and create a distribution vector of city visits.
- Data Standardization: Standardize the city visit distribution using
StandardScaler
to prepare for clustering. - Cluster Formation: Implement KMeans clustering with a predefined number of clusters (in this case, 9) to segment users based on their city visit patterns.
- Cluster Analysis: Analyze mean visit frequencies per cluster to understand geographical tendencies of user segments.
Complements the clustering by integrating categorical data from user interactions with businesses, providing a richer set of features that can be used to refine user profiles and personalize experiences further.
Scaling Models
- Clustering allows for the segmentation of users into meaningful groups that can have models tailored to their specific characteristics.
- By adjusting model parameters or strategies for different clusters, performance can be optimized on a per-group basis, enhancing overall efficiency.
Model Ensembling
- Combining predictions from multiple models can lead to more accurate and robust predictions.
- Clusters can serve as a meta-feature in larger machine learning frameworks (e.g., input to neural networks or decision trees), helping to predict user behaviors or preferences more effectively.
- Alternatively, different predictive models can be employed for each cluster, and their outputs merged to form a composite prediction, thereby leveraging strengths of various approaches.
# Scaling city visit data
scaler = StandardScaler()
city_matrix_scaled = scaler.fit_transform(city_matrix.drop('user_id', axis=1))
# Applying KMeans
kmeans = KMeans(n_clusters=9, random_state=42)
clusters = kmeans.fit_predict(city_matrix_scaled)
# Analyzing clusters
cluster_city_means = city_matrix.groupby('Cluster')[all_cities].mean()
- better_XGB.py is the exact script for running everything and output a prediction.csv, in which KMeans_user_cluster.py further process features from better_features.py in better_XGB.py, preparing for Model Training and Evaluation.
- better_features.py is a Class for User-Business Interaction Data ETL and provides a interface for train and test data,
- utils.py and KMeans_user_cluster.py are encapsulated and served as functions.
Deploy on your local Env using the Dockerfile provided.
My workstation is M3 Chip Macbook, you may want to uncomment the 1st line in Dockerfile for building upon AMD-based architecture.
# FROM --platform=linux/amd64 ubuntu:20.04 AS builder
# If M-chip Mac, use arm-64; Intel-Chip => amd64
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
# ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64