Collaborative Filtering with Temporal Dynamics, (Koren 2009)
Deep Learning Recommender Model for Personalization and Recommendation Systems, (Naumov et al. 2019)
What are Recommender Systems?
The concept of recommender systems emerged in the late 1990s / early 2000s as social life moved online:
online purchasing and commerce
online discussions and ratings
social information sharing
In these systems the amount of content was exploding and users were having a hard time finding things they were interested in.
Users wanted recommendations.
Over time, the problem has only gotten worse:
An enormous need has emerged for systems to help sort through products, services, and content items.
This often goes by the term personalization.
Some examples:
Movie recommendation (Netflix ~6.5K movies and shows, YouTube ~14B videos)
Related product recommendation (Amazon ~600M products)
Web page ranking (Google >>100B pages)
Social content filtering (Facebook, Twitter)
Services (Airbnb, Uber, TripAdvisor)
News content recommendation (Apple News)
Priority inbox & spam filtering (Gmail)
Online dating (Match.com)
A more formal view:
User - requests content
Objects - instances of content
Context - device, location, time, history
Interface - browser, mobile
Inferring Preferences
Unfortunately, users generally have a hard time explaining what types of content they prefer.
Some early systems worked by interviewing users to ask what they liked. Those systems did not work very well.
A very interesting article about the earliest personalization systems is User Modeling via Stereotypes by Elaine Rich, dating from 1979.
Instead, modern systems work by capturing user’s opinions about specific items.
This can be done actively:
When a user is asked to rate a movie, product, or experience,
or it can be done passively:
By noting which items a user chooses to purchase (for example).
Challenges
The biggest issue is scalability: typical data for this problem is huge.
Millions of objects
100s of millions of users
Changing user base
Changing inventory (movies, stories, goods)
Available features
Imbalanced dataset
User activity / item reviews are power law distributed
This data is a subset of the data presented in: “From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews,” by J. McAuley and J. Leskovec. WWW, 2013
Example
Let’s look at a dataset for testing recommender systems consisting of Amazon movie reviews:
We’ll download a compressed pickle file containing the data if it is not already present.
Code
# This is a 647 MB file, delete it after useimport gdownurl ="https://drive.google.com/uc?id=14GakA7oOjbQp7nxcGApI86WlP3GrYTZI"pickle_output ="train.pkl.gz"import os.pathifnot os.path.exists(pickle_output): gdown.download(url, pickle_output)
from IPython.display import display, Markdownn_users = df["UserId"].unique().shape[0]n_movies = df["ProductId"].unique().shape[0]n_reviews =len(df)display(Markdown(f'There are:\n'))display(Markdown(f'* {n_reviews:,} reviews\n* {n_movies:,} movies\n* {n_users:,} users'))display(Markdown(f'There are {n_users * n_movies:,} potential reviews, meaning sparsity of {(n_reviews/(n_users * n_movies)):0.4%}'))
There are:
1,697,533 reviews
50,052 movies
123,960 users
There are 6,204,445,920 potential reviews, meaning sparsity of 0.0274%
where
\[
\text{sparsity}
= \frac{\text{\# of reviews}}{\text{\# of users} \times \text{\# of movies}}
= \frac{\text{\# of reviews}}{\text{\# of potential reviews}}
\]
Reviews are Sparse
Only 0.02% of the reviews are available – 99.98% of the reviews are missing.
Code
display(Markdown(f'There are on average {n_reviews/n_movies:0.1f} reviews per movie'+f' and {n_reviews/n_users:0.1f} reviews per user'))
There are on average 33.9 reviews per movie and 13.7 reviews per user
Sparseness is Skewed
Although on average a movie receives 34 reviews, almost all movies have even fewer reviews.
Code
reviews_per_movie = df.groupby('ProductId').count()['Id'].valuesfrac_below_mean = np.sum(reviews_per_movie < (n_reviews/n_movies))/len(reviews_per_movie)plt.plot(sorted(reviews_per_movie, reverse=True), '.-')xmin, xmax, ymin, ymax = plt.axis()plt.hlines(n_reviews/n_movies, xmin, xmax, 'r', lw =3)plt.ylabel('Number of Ratings', fontsize =14)plt.xlabel('Movie', fontsize =14)plt.legend(['Number of Ratings', 'Average Number of Ratings'], fontsize =14)plt.title(f'Amazon Movie Reviews\nNumber of Ratings Per Movie\n'+f'{frac_below_mean:0.0%} of Movies Below Average', fontsize =16);
Likewise, although the average user writes 14 reviews, almost all users write even fewer reviews.
Code
reviews_per_user = df.groupby('UserId').count()['Id'].valuesfrac_below_mean = np.sum(reviews_per_user < (n_reviews/n_users))/len(reviews_per_user)plt.plot(sorted(reviews_per_user, reverse=True), '.-')xmin, xmax, ymin, ymax = plt.axis()plt.hlines(n_reviews/n_users, xmin, xmax, 'r', lw =3)plt.ylabel('Number of Ratings', fontsize =14)plt.xlabel('User', fontsize =14)plt.legend(['Number of Ratings', 'Average Number of Ratings'], fontsize =14)plt.title(f'Amazon Movie Reviews\nNumber of Ratings Per User\n'+f'{frac_below_mean:0.0%} of Users Below Average', fontsize =16);
Objective Function
Ultimately, our goal is to predict the rating that a user would give to an item.
For that, we need to define a loss or objective function.
A typical objective function is root mean square error (RMSE)
# Create new lists where only numbers are kept that are not np.nan in both listsfiltered_ratings_item_i = [rating_i for rating_i, rating_j inzip(ratings_item_i, ratings_item_j) ifnot np.isnan(rating_i) andnot np.isnan(rating_j)]filtered_ratings_item_j = [rating_j for rating_i, rating_j inzip(ratings_item_i, ratings_item_j) ifnot np.isnan(rating_i) andnot np.isnan(rating_j)]display(Markdown(f'Common ratings for item $i$: {filtered_ratings_item_i}'))display(Markdown(f'Common ratings for item $j$: {filtered_ratings_item_j}'))
Common ratings for item \(i\): [5, 5, 4, 4, 1]
Common ratings for item \(j\): [2, 5, 5, 3, 5]
Now we can compute the Pearson correlation coefficient:
The Pearson correlation coefficient, often denoted as \(r\), is a measure of the linear correlation between two variables \(X\) and \(Y\). It quantifies the degree to which a linear relationship exists between the variables. The value of \(r\) ranges from -1 to 1, where:
\(r = 1\) indicates a perfect positive linear relationship,
\(r = -1\) indicates a perfect negative linear relationship,
\(r = 0\) indicates no linear relationship.
The formula for the Pearson correlation coefficient is:
Where: - \(X_i\) and \(Y_i\) are the individual sample points, - \(\bar{X}\) and \(\bar{Y}\) are the means of the \(X\) and \(Y\) samples, respectively.
The Pearson correlation coefficient is sensitive to outliers and assumes that the relationship between the variables is linear and that the data is normally distributed.
Similarity for Binary Data
In some cases we will need to work with binary \(r_{ui}\).
For example, purchase histories on an e-commerce site, or clicks on an ad.
In this case, an appropriate replacement for Pearson \(r\) is the Jaccard similarity coefficient or Intersection over Union.
Here \(i \in R(u)\) means the set of items rated by user \(u\) and \(u \in R(i)\) means the set of users who have rated item \(i\) and \(|R(u)|\) is the number of ratings.
Now that we have learned the biases, we can do a better job of estimating correlation:
\(k\) is the rank of the factorization and dimensionality of the latent space.
The \((\cdot)_S\) notation means that we are only considering the subset of matrix entries that correspond to known reviews (the set \(S\)).
Note that as usual, we add \(\ell_2\) penalization to avoid overfitting (Ridge regression).
Once again, this problem is jointly convex in that it is convex in each of the variables \(U\) and \(V\).
In particular, if we hold either \(U\) or \(V\) constant, then the result is a simple ridge regression.
So one commonly used algorithm for this problem is called alternating least squares (ALS):
Hold \(U\) constant, and solve for \(V\)
Hold \(V\) constant, and solve for \(U\)
If not converged, go to Step 1.
The only thing we’ve left out at this point is how to deal with the missing entries of \(R\).
It’s not hard, but the details aren’t that interesting, so we’ll give you code instead!
ALS in Practice
The entire Amazon reviews dataset is too large to work with easily, and it is too sparse.
Hence, we will take the densest rows and columns of the matrix.
Code
# The densest columns: products with more than 50 reviewspids = df.groupby('ProductId').count()['Id']hi_pids = pids[pids >50].index# reviews that are for these productshi_pid_rec = [r in hi_pids for r in df['ProductId']]# the densest rows: users with more than 50 reviewsuids = df.groupby('UserId').count()['Id']hi_uids = uids[uids >50].index# reviews that are from these usershi_uid_rec = [r in hi_uids for r in df['UserId']]# The result is a list of booleans equal to the number of rewviews# that are from those dense users and moviesgoodrec = [a and b for a, b inzip(hi_uid_rec, hi_pid_rec)]
Now we create a \(\textnormal{UserID} \times \textnormal{ProductID}\) matrix from these reviews.
# Import local python package MF.pyimport recommender_MF as MF# Instantiate the model# We are pulling these hyperparameters out of the air -- that's not the right way to do it!RS = MF.als_MF(rank =20, lambda_ =1)
Code
%time pred, error = RS.fit_model(R)
CPU times: user 1min 7s, sys: 4.28 s, total: 1min 11s
Wall time: 46.2 s
Code
print(f'RMSE on visible entries (training data): {np.sqrt(error/R.count().sum()):0.3f}')
RMSE on visible entries (training data): 0.343
And we can look at the predicted ratings matrix and see that it is a dense matrix:
Code
pred
ProductId
0005019281
0005119367
0307142485
0307142493
0307514161
0310263662
0310274281
0718000315
0764001035
0764003828
...
B00IKM5OCO
B00IWULQQ2
B00J4LMHMK
B00J5JSV1W
B00JA3RPAG
B00JAQJMJ0
B00JBBJJ24
B00JKPHUE0
B00K2CHVJ4
B00L4IDS4W
UserId
A02755422E9NI29TCQ5W3
5.160166
5.485254
5.216541
4.882483
5.554725
5.410966
4.859719
4.288049
4.665462
4.909857
...
3.767078
5.087916
5.366999
2.512491
4.273543
5.009023
4.753608
5.142484
4.014472
4.968972
A100JCBNALJFAW
3.412105
3.744585
3.071588
2.812383
3.977737
4.068442
4.728968
3.034348
3.806690
2.558996
...
1.989275
3.774524
2.967807
1.506393
3.071853
2.692521
3.645560
0.405617
2.373988
2.260089
A10175AMUHOQC4
4.795672
5.008185
4.738192
4.490579
5.487747
4.674548
4.558869
4.391259
4.936988
3.871651
...
3.646807
4.867140
5.994464
1.694122
3.866068
4.729658
5.032468
3.978302
4.371713
5.093678
A103KNDW8GN92L
4.747010
5.052763
3.904867
4.863294
4.291550
4.345711
4.585568
4.012744
4.769008
5.532848
...
4.209616
4.569735
4.669492
2.669940
3.862243
3.676272
4.060489
3.466910
4.092287
3.587281
A106016KSI0YQ
4.387333
4.805163
3.812473
4.969698
4.862461
4.018651
4.095438
3.850731
3.346309
4.515175
...
3.847274
4.333298
3.809869
2.479568
3.771813
4.994529
5.102042
4.307609
3.593250
3.565460
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
AZUBX0AYYNTFF
4.038671
4.390536
3.875080
4.162316
4.323401
3.942158
4.114244
3.217396
3.116887
3.329173
...
2.532802
3.726081
2.662176
2.108711
2.764652
3.977901
4.191627
3.184184
2.412499
2.828415
AZXGPM8EKSHE9
2.433849
3.424032
3.272877
3.538645
2.119925
3.445304
2.732960
2.339688
3.750510
2.224514
...
0.882552
3.252535
2.276738
3.308828
3.159818
3.429537
3.044300
3.026746
2.043627
3.710099
AZXHK8IO25FL6
3.927449
4.271830
3.756855
3.237490
3.709682
2.173375
3.983762
3.087339
3.483337
3.409457
...
3.784209
3.612386
3.676185
1.538733
2.952155
4.562115
3.293394
2.424861
1.135759
2.891152
AZXR5HB99P936
4.464123
4.704419
3.649212
3.934679
5.211955
4.227345
4.553247
3.762468
3.163137
3.565466
...
3.115093
3.980199
4.028849
2.118895
3.105410
4.303757
4.241257
3.670543
3.001280
3.352112
AZZ4GD20C58ND
3.990167
4.346322
4.375355
3.453328
4.776714
4.549703
4.633570
3.620664
4.371348
2.872648
...
3.147763
4.447010
3.998075
2.779356
2.952121
4.786318
3.901312
2.024843
3.625450
3.043700
3677 rows × 7244 columns
Code
## todo: hold out test data, compute oos error# We create a mask of the known entries, then calculate the indices of the known# entries, then split that data into training and test sets.# Create a mask for the known entriesRN =~R.isnull()# Get the indices of the known entriesvisible = np.where(RN)# Split the data into training and test setsimport sklearn.model_selection as model_selectionX_train, X_test, Y_train, Y_test = model_selection.train_test_split(visible[0], visible[1], test_size =0.1)
Just for comparison’s sake, let’s check the performance of \(k\)-NN on this dataset.
Again, this is only on the training data – so overly optimistic for sure.
And note that this is a subset of the full dataset – the subset that is “easiest” to predict due to density.
Code
from sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import mean_squared_error# Drop the columns that are not featuresX_train = good_df.drop(columns=['Id', 'ProductId', 'UserId', 'Text', 'Summary'])# The target is the scorey_train = good_df['Score']# Using k-NN on features HelpfulnessNumerator, HelpfulnessDenominator, Score, Timemodel = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)%time y_hat = model.predict(X_train)
CPU times: user 9.41 s, sys: 228 ms, total: 9.64 s
Wall time: 9.42 s
Code
print(f'RMSE on visible entries (test set): {mean_squared_error(y_train, y_hat, squared =False):0.3f}')
RMSE on visible entries (test set): 0.648
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/sklearn/metrics/_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.
warnings.warn(
Assessing Matrix Factorization
Matrix Factorization per se is a good idea.
However, many of the improvements we’ve discussed for CF apply to MF as well.
To illustrate, we’ll look at some of the successive improvements used by the team that won the Netflix prize (“BellKor’s Pragmatic Chaos”).
When the prize was announced, the Netflix supplied solution achieved an RMSE of 0.951.
By the end of the competition (about 3 years), the winning team’s solution achieved RMSE of 0.856.
Let’s restate our MF objective in a way that will make things clearer:
Embeddings: Dense representations for categorical data.
Bottom MLP: Transforms dense continuous features.
Feature Interaction: Dot-product of embeddings and dense features.
Top MLP: Processes interactions and outputs probabilities.
Let’s look at each of these components in turn.
Embeddings
Embeddings: Map categorical inputs to latent factor space.
A learned embedding matrix \(W \in \mathbb{R}^{m \times d}\) for each category of input
One-hot vector \(e_i\) with \(i\text{-th}\) entry 1 and rest are 0s
Embedding of \(e_i\) is \(i\text{-th}\) row of \(W\), i.e., \(w_i^T = e_i^T W\)
We can also use weighted combination of multiple items with a multi-hot vector of weights \(a^T = [0, ..., a_{i_1}, ..., a_{i_k}, ..., 0]\).
The embedding of this multi-hot vector is then \(a^T W\).
PyTorch has a convenient way to do this using EmbeddingBag, which besides summing can combine embeddings via mean or max pooling.
Here’s an example with 5 embeddings of dimension 3:
Code
import torchimport torch.nn as nn# Example embedding matrix: 5 embeddings, each of dimension 3embedding_matrix = nn.EmbeddingBag(num_embeddings=5, embedding_dim=3, mode='mean')# Input: Indices into the embedding matrixinput_indices = torch.tensor([1, 2, 3, 4]) # Flat list of indicesoffsets = torch.tensor([0, 2]) # Start new bag at position 0 and 2 in input_indices# Forward passoutput = embedding_matrix(input_indices, offsets)print("Embedding Matrix:\n", embedding_matrix.weight)print("Output:\n", output)
The advantage of the DLRM architecture is that it can take continuous features as input such as the user’s age, time of day, etc.
There is a bottom MLP that transforms these dense features into a latent space of the same dimension \(d\).
DLRM Architecture
Optional Sparse Feature MLPs
Optionally, one can add MLPs to transform the sparse features as well.
DLRM Architecture
Feature Interactions
The 2nd order interactions are modeled via dot-products of all pairs from the collections of embedding vectors and processed dense features.
The results of the dot-product interactions are concatenated with the processed dense vectors.
DLRM Architecture
Top MLP
The concatenated vector is then passed to a final MLP and then to a sigmoid function to produce the final prediction (e.g., probability score of recommendation)
This entire model is trained end-to-end using standard deep learning techniques.
There are a number of concerns with the widespread use of recommender systems and personalization in society.
First, recommender systems are accused of creating filter bubbles.
A filter bubble is the tendency for recommender systems to limit the variety of information presented to the user.
The concern is that a user’s past expression of interests will guide the algorithm in continuing to provide “more of the same.”
This is believed to increase polarization in society, and to reinforce confirmation bias.
Second, recommender systems in modern usage are often tuned to maximize engagement.
In other words, the objective function of the system is not to present the user’s most favored content, but rather the content that will be most likely to keep the user on the site.
The incentive to maximize engagement arises on sites that are supported by advertising revenue.
More engagement time means more revenue for the site.
However, many studies have shown that sites that strive to maximize engagement do so in large part by guiding users toward extreme content:
content that is shocking,
or feeds conspiracy theories,
or presents extreme views on popular topics.
Given this tendency of modern recommender systems, for a third party to create “clickbait” content such as this, one of the easiest ways is to present false claims.
Methods for addressing these issues are being very actively studied at present.
Ways of addressing these issues can be:
via technology
via public policy
Recap and References
BU CS/CDS Research
You can read about some of the work done in Professor Mark Crovella’s group on this topic:
Introduction to recommender systems and their importance in modern society.
Explanation of collaborative filtering (CF) and its two main approaches: user-user similarity and item-item similarity.
Discussion on the challenges of recommender systems, including scalability and data sparsity.
Introduction to matrix factorization (MF) as an improvement over CF, using latent vectors and alternating least squares (ALS) for optimization.
Practical implementation of ALS for matrix factorization on a subset of Amazon movie reviews.
Review of Deep Learning Recommender Model (DLRM) architecture and its components.
Discussion on the societal impact of recommender systems, including filter bubbles and engagement maximization.
References
Koren, Yehuda. 2009. “Collaborative Filtering with Temporal Dynamics.” In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 42:447–56. 8. https://dl.acm.org/doi/10.1145/1557019.1557072.
Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. “Matrix Factorization Techniques for Recommender Systems.”Computer 42 (8): 30–37. https://ieeexplore.ieee.org/document/5197422.
Naumov, Maxim et al. 2019. “Deep Learning Recommendation Model for Personalization and Recommendation Systems.”arXiv Preprint arXiv:1906.00091, May. http://arxiv.org/abs/1906.00091.