The goal of the challenge is to develop a system for the task of automatic playlist continuation. Given a set of playlist features, participants’ systems shall generate a list of recommended tracks that can be added to that playlist, thereby ‘continuing’ the playlist. We define the task formally as follows:
Input
A user-created playlist, represented by:
- Playlist metadata
- K seed tracks: a list of the K tracks in the playlist, where K can equal 0, 1, 5, 10, 25, or 100.
Output
- A list of 500 recommended candidate tracks, ordered by relevance in decreasing order.
The Million Playlist Dataset (MPD) contains 1,000,000 playlists created by users on the Spotify platform. It can be used by researchers interested in exploring how to improve the music listening experience.
The MPD contains a million user-generated playlists. These playlists were created during the period of January 2010 through October 2017. Each playlist in the MPD contains a playlist title, the track list (including track metadata) editing information (last edit time, number of playlist edits) and other miscellaneous information about the playlist.
Detailed description
The Million Playlist Dataset consists of 1,000 slice files. These files have the naming convention of:
mpd.slice.STARTING_PLAYLIST_ID_-_ENDING_PLAYLIST_ID.json
For example, the first 1,000 playlists in the MPD are in a file called
mpd.slice.0-999.json
and the last 1,000 playlists are in a file called
mpd.slice.999000-999999.json
.
Each slice file is a JSON dictionary with two fields: info and playlists.
The info field is a dictionary that contains general information about the particular slice:
- slice - the range of slices that in in this particular file - such as 0-999
- version - - the current version of the MPD (which should be v1)
- generated_on - a timestamp indicating when the slice was generated.
This is an array that typically contains 1,000 playlists. Each playlist is a dictionary that contains the following fields:
- pid - integer - playlist id - the MPD ID of this playlist. This is an integer between 0 and 999,999.
- name - string - the name of the playlist
- description - optional string - if present, the description given to the playlist. Note that user-provided playlist descrptions are a relatively new feature of Spotify, so most playlists do not have descriptions.
- modified_at - seconds - timestamp (in seconds since the epoch) when this playlist was last updated. Times are rounded to midnight GMT of the date when the playlist was last updated.
- num_artists - the total number of unique artists for the tracks in the playlist.
- num_albums - the number of unique albums for the tracks in the playlist
- num_tracks - the number of tracks in the playlist
- num_followers - the number of followers this playlist had at the time the MPD was created. (Note that the follower count does not including the playlist creator)
- num_edits - the number of separate editing sessions. Tracks added in a two hour window are considered to be added in a single editing session.
- duration_ms - the total duration of all the tracks in the playlist (in milliseconds)
- collaborative - boolean - if true, the playlist is a collaborative playlist. Multiple users may contribute tracks to a collaborative playlist.
- tracks - an array of information about each track in the playlist. Each element in the array is a dictionary with the following fields:
- track_name - the name of the track
- track_uri - the Spotify URI of the track
- album_name - the name of the track's album
- album_uri - the Spotify URI of the album
- artist_name - the name of the track's primary artist
- artist_uri - the Spotify URI of track's primary artist
- duration_ms - the duration of the track in milliseconds
- pos - the position of the track in the playlist (zero-based)
Here's an example of a typical playlist entry:
{
"name": "musical",
"collaborative": "false",
"pid": 5,
"modified_at": 1493424000,
"num_albums": 7,
"num_tracks": 12,
"num_followers": 1,
"num_edits": 2,
"duration_ms": 2657366,
"num_artists": 6,
"tracks": [
{
"pos": 0,
"artist_name": "Degiheugi",
"track_uri": "spotify:track:7vqa3sDmtEaVJ2gcvxtRID",
"artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
"track_name": "Finalement",
"album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
"duration_ms": 166264,
"album_name": "Dancing Chords and Fireflies"
},
{
"pos": 1,
"artist_name": "Degiheugi",
"track_uri": "spotify:track:23EOmJivOZ88WJPUbIPjh6",
"artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
"track_name": "Betty",
"album_uri": "spotify:album:3lUSlvjUoHNA8IkNTqURqd",
"duration_ms": 235534,
"album_name": "Endless Smile"
},
{
"pos": 2,
"artist_name": "Degiheugi",
"track_uri": "spotify:track:1vaffTCJxkyqeJY7zF9a55",
"artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
"track_name": "Some Beat in My Head",
"album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
"duration_ms": 268050,
"album_name": "Dancing Chords and Fireflies"
},
// 8 tracks omitted
{
"pos": 11,
"artist_name": "Mo' Horizons",
"track_uri": "spotify:track:7iwx00eBzeSSSy6xfESyWN",
"artist_uri": "spotify:artist:3tuX54dqgS8LsGUvNzgrpP",
"track_name": "Fever 99\u00b0",
"album_uri": "spotify:album:2Fg1t2tyOSGWkVYHlFfXVf",
"duration_ms": 364320,
"album_name": "Come Touch The Sun"
}
],
}
I build my own challenge Set based on criteria of official one but with some modification
Test Set Format
The challenge set consists of a single JSON dictionary with three fields:
- date - the date the challenge set was generated. This should be "2018-01-16 08:47:28.198015"
- version - the version of the challenge set. This should be "v1"
- playlists - an array of 10,000 incomplete playlists. Each element in this array contains the following fields:
- pid - the playlist ID
- name - (optional) - the name of the playlist. For some challenge playlists, the name will be missing.
- num_holdouts - the number of tracks that have been omitted from the playlist
- tracks - a (possibly empty) array of tracks that are in the playlist. Each element of this array contains the following fields:
- pos - the position of the track in the playlist (zero offset)
- track_name - the name of the track
- track_uri - the Spotify URI of the track
- artist_name - the name of the primary artist of the track
- artist_uri - the Spotify URI of the primary artist of the track
- album_name - the name of the album that the track is on
- album_uri -- the Spotify URI of the album that the track is on
- duration_ms - the duration of the track in milliseconds
- num_samples the number of tracks included in the playlist
- num_tracks - the total number of tracks in the playlist.
How to build Test Set
Dataset contains PID from 0 to 999,999 (1 mil playlists), with M unique Tracks
In order to replicate competition's test set, we remove some playlists from original playlists such that
- All tracks in the challenge set appear in the MPD
- All holdout tracks appear in the MPD
The test set contains 10 difference challenges, each challenge contains 1000 playlists sampled from MPD:
- Predict tracks for a playlist given its title and the first 5 tracks
- Predict tracks for a playlist given its title and the first 10 tracks
- Predict tracks for a playlist given its title and the first 25 tracks
- Predict tracks for a playlist given its title and 25 random tracks
- Predict tracks for a playlist given its title and the first 50 tracks
- Predict tracks for a playlist given its title and 50 random tracks
- Predict tracks for a playlist given its title and the first 100 tracks
- Predict tracks for a playlist given its title and 100 random tracks
- Predict tracks for a playlist given its title and the first 200 tracks
- Predict tracks for a playlist given its title and 200 random tracks
How to build train Set
Train set will be the original set subtracts tracks in test sets
Submissions will be evaluated using the following metrics. All metrics will be evaluated at both the track level (exact track must match) and the artist level (any track by that artist is a match). In the following, we denote the ground truth set of tracks by G, and the ordered list of recommended tracks by R. The size of a set or list is denoted by | ⋅ |, and we use from:to-subscripts to index a list. In the case of ties on individual metrics, earlier submissions are ranked higher.
R-precision
R-precision is the number of retrieved relevant tracks divided by the number of known relevant tracks (i.e., the number of withheld tracks):
The metric is averaged across all playlists in the challenge set. This metric rewards total number of retrieved relevant tracks (regardless of order).
Normalized discounted cumulative gain (NDCG)
Discounted cumulative gain (DCG) measures the ranking quality of the recommended tracks, increasing when relevant tracks are placed higher in the list. Normalized DCG (NDCG) is determined by calculating the DCG and dividing it by the ideal DCG in which the recommended tracks are perfectly ranked:
The ideal DCG or IDCG is, on our case, equal to:
If the size of the set intersection of G and R, is empty, then the DCG is equal to 0. The NDCG metric is now calculated as:
Recommended Songs clicks
Recommended Songs is a Spotify feature that, given a set of tracks in a playlist, recommends 10 tracks to add to the playlist. The list can be refreshed to produce 10 more tracks. Recommended Songs clicks is the number of refreshes needed before a relevant track is encountered. It is calculated as follows:
If the metric does not exist (i.e. if there is no relevant track in R), a value of 51 is picked (which is 1 + the maximum number of clicks possible).
Rank aggregation
Final rankings will be computed by using the Borda Count election strategy. For each of the rankings of p participants according to R-precision, NDCG, and Recommended Songs clicks, the top ranked system receives p points, the second system receives p-1 points, and so on. The participant with the most total points wins. In the case of ties, we use top-down comparison: compare the number of 1st place positions between the systems, then 2nd place positions, and so on.
A recommender system would suggest relevant products such as books, movies, songs or friends for customer. There are two basic types of recommenders. One is called content-based filtering and the other is collaborative filtering. Generally speaking, content-based systems are simpler but come up with less interesting recommendations. Collaborative systems can get very complicated and unwieldy and require a lot of user-generated data, but they’re the state of the art.
Content-based filtering: Some experts or customers will manually investigate products and put them label them to various category / attributes. Once we have all features, we can calculate similarity between each item to other and retrieve relevant items.
Collaborative filtering (CF): In contrast of content-based filtering, collaborative filtering does not require manual labels. The system would look at users who have similar taste / behavior to make product suggestion. In collaborative filtering, we also have two types, one is memory-based and other is model-based. While memory-based CF needs to construct a large matrix of user-item to calculate similarity, model-based CF requires machine learning algorithms such as clustering, matrix factorization and deep learning.
In Memory-based filtering, we also have two types, one is User-Item CF and other is Item-Item CF.
- User-Item CF: take a particular person, find people who are similar to that person based on similar ratings, and recommend items those similar people liked. “Customers who are similar to you also liked …”
- Item-Item CF: Item-item filtering will take a particular item, find people who liked that item, and find other items that those people (or people similar to them) also liked (in common?). It takes items and outputs other items as recommendations. “Customers who liked this item also liked …”
- In short User-Item CF is “Customers who are similar to you also liked …” and Item-Item CF is “Customers who liked this item also liked …”
Memory-based algorithms are easy to implement and produce reasonable prediction quality. The drawback of memory-based CF is that it doesn't scale to real-world scenarios and doesn't address the well-known cold-start problem, that is when new user or new item enters the system. Model-based CF methods are scalable and can deal with higher sparsity level than memory-based models, but also suffer when new users or items that don't have any ratings enter the system.
In Model-based filtering, In model-based CF, we all start with Matric factorization. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF.
The goal of Matrix factorization is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.
For our problem, we treat "user" as "playlist" and "item" as "song". Because the dataset don't provide much information about each song so we won't use content-based filtering. Therefore we would only focus on
- KNN
- Collaborative Filtering
- Frequent Pattern Growth
- Word2Vec.
- Playlist based CF: From similarity between each playlist and how other playlists "rate" (include or not) a track, I can infer the current "rate".
- Song based CF: From similarity between each songs and how current playlist "rate" other songs, I can infer the current "rate".
In Collaborative Filtering, we following the procedure
- Construct playlist-song matrix
"1" means that song is included in the playlist and "0" otherwise. For example, playlist 1 contains song 2 and song 3, song 2 also includes in playlist 2.
-
First we split the data into training and testing set. Refer to Challenge Set for how to build test set.
-
Calculate the similarity between song-song or playlist-playlist. In playlist-playlist similarity, we take each row as a vector while in song-song similarity we take column as a vector.
similarity metrics:
- Cosine
- Euclidean
Formula here
- Pearson Correlation
Formula here
- Based on the similarity matrix and, we make the prediction on the testing set.
For playlist-playlist, we predict that a playlist p contains song s is given by the weighted sum of all other playlists' containing for song s where the weighting is the cosine similarity between the each playlist and the input playlist p. Then normalizing the result.
With song-song, we simply replace similarity matrix of playlists by that of songs.
- Playlist-based (like user-based)
- Build similarity matrix between playlists (cosine, euclidean, Pearson correlation)
- For each playlist Px:
n = 1
While total_track is not 500:
Find n-th most relevant playlist of Px, called Pr
Add K (or all) songs in Pr to Px
Increment n by 1
- Song-based (like item-based)
(Our space is the space of similarity)
- Build similarity matrix between songs (cosine, euclidean, Pearson Correlation)
- For each playlist Px:
Compute "cluster center" by averaging all similarities
Get K = 500 nearest neighbor and add to existing songs
- Evaluate based on 4 metrics
Matrix factorization can be done with orthogonal factorization (SVD), probabilistic factorization (PMF) or Non-Negative factorization (NMF).
Singular Value Decomposition (SVD) Collaborative Filtering can be formulated by approximating a matrix X by using singular value decomposition. The winning team at the Netflix Prize competition used SVD matrix factorization models to produce product recommendations, for more information I recommend to read articles: Netflix Recommendations: Beyond the 5 stars and Netflix Prize and SVD. The general equation can be expressed as follows: X = U × S × V_transpose
Deep Learning Once we have latent feature vectors of playlist and songs, we can feed them to a Deep Learning model or any Machine Learning model to predict results.
- Combine Memory-Based and Model-based
- Single PC maybe cannot handle this dataset.
-
Number of:
- playlists: 1000000
- Tracks: 66346428
- Unique tracks: 2262292
- Unique albums: 734684
- Unique Titles: 92944
-
Distribution of: Playlist length, Number of Albums / Playlist, Number of Artist / Playlist, Number of edits / Playlist, Number of Followers / Playlist, Number of Tracks / Playlist
As we can see all distributions are left-skewed which means if we are looking for average value, we should go for "Median" not "Mean"
- Median of playlist length: 11422438.0
- Median of number of albums in each playlist: 37.0
- Median of number of artists in each playlist: 29.0
- Median of number of edits in each playlist: 29.0
- Median of number of followers in each playlist: 1.0
- Median of number of tracks in each playlist: 49.0
-
Top 20 Songs in Sporify playlists
-
Top 20 Artist in Spotify Playlists
- Description for small Dataset.
- df_train: Columns = [pid,tid,pos], size = 1mil playlists with 10000 incomplete playlists
- df_test: Columns = [pid,tid,pos], size = 10000 complete playlists
- File
- buildChallengeSet: replicate the challenge set
- buildPLaylistSongMatrix: export playlist-song matrix in format [pid, list(tid), list(pos)]
- helper: helping function
- evalution:
- baseline
- Result of item-item / user-item based
Method | Parameter | R-precision | NDCG | Song Click | Time Taken (s) |
---|---|---|---|---|---|
Playlist-based baseline | 0.7766 | 1.6010 | 0.0 | 41.42 | |
Song-based baseline | 0.7847 | 0.7975 | 0.0 | 4183 | |
Word2Vec + Song-based | 100-200-300 dimension | 0.0030 | 0.004 | 10.35 | |
Word2Vec + Playlist-based | min_fre = 3, dimension 50 | 0.0171 | 0.015 | 8.086 | |
Word2Vec + Playlist-based | min_fre = 3, dimension 100 | 0.0190 | 0.0172 | 7.805 | |
Playlist-based CF (get top K rating songs) | 0.7844 | 0.8011 | 0.0 | approx 12000 | |
FP Growth |
Method | Parameter | RMS | R-precision |
---|---|---|---|
Playlist-based CF (get top K rating songs) | 0.00716 | 0.621 | |
Song-based CF | |||
Matrix Factorization |
- Conclusion
- Use Python library Surprise to do Matrix Factorization
-
Week 0
- Project Proposal
-
Week 1
- Work on some tutorials of CF from Movie Data.
- Complete Section 16: Data Science at Scale, focus on Spark and PySpark
-
Week 2
- Parse JSON to Dataframe and save to disk for later usage.
- Take a small subset of data and do the same. Start with Memory-based model (Item-item, and user-item) -> Not enough Memory even for small data
- Apply what you've learned to do some basic EDA of data.
- Setup Spark to local Machine / server.
- Read solutions of RecSys of winners. https://github.com/VasiliyRubtsov/recsys2018/blob/master/json_to_dataframe.ipynb
-
Week 3
- Build a test set
- Configure to connect Spyder to Server
- Implement function to compute the metrics
- Build a giant table of user-item -> DONT DO THIS -> ACTUALLY I DID IT
-
Week
- Word2vec model for song-based model in Spark
- Word2vec model for playlist-based model in Spark
- Implement playlist-based CF -> Need to test
- Implement song-based CF -> Need to test
- NMF with Sk-learn with scipy.sparse vector
- ALS with Vectos.sparse in Spark
-
Week:
-
Week
- Compare model with various Metrics
- Tune model and finalize the results.
- Finish the report.
- Many Types of Recommender Systems
- Playlist Recommender
- Millions song dataset
- Amazon Recommendations Item-to-Item Collaborative Filtering
- Collaborative Filterring at Spottify
- Implementing CF in Python
- Implementing CF in Python with Last.fm dataset
- Theory and implementation of CF -must read
- must read
- Spark API collaborative filtering
- Large-Scale Parallel Collaborative Filtering for the Netflix Prize
- Advantage of item-based CF over user-based CF
- Good Paper about their work
- FPGrowth in Spark -> You may use it after CF
- Matrix Factorization at scale in Spark
- ALS without Spark
- Install Spark Or you just need to execute this
conda install pyspark