Recommender Systems

Introduction

Introduction

Today, we look at a topic that has become enormously important in society: recommender systems.

We will

  • Define recommender systems
  • Review the challenges they pose
  • Discuss two classic methods:
    • Collaborative Filtering
    • Matrix Factorization
  • And one modern method:
    • Deep Learning Recommender Model (DLRM)

This section draws heavily on

What are Recommender Systems?

The concept of recommender systems emerged in the late 1990s / early 2000s as social life moved online:

  • online purchasing and commerce
  • online discussions and ratings
  • social information sharing

In these systems the amount of content was exploding and users were having a hard time finding things they were interested in.

Users wanted recommendations.


Over time, the problem has only gotten worse:



An enormous need has emerged for systems to help sort through products, services, and content items.

This often goes by the term personalization.

Some examples:

  • Movie recommendation (Netflix ~6.5K movies and shows, YouTube ~14B videos)
  • Related product recommendation (Amazon ~600M products)
  • Web page ranking (Google >>100B pages)
  • Social content filtering (Facebook, Twitter)
  • Services (Airbnb, Uber, TripAdvisor)
  • News content recommendation (Apple News)
  • Priority inbox & spam filtering (Gmail)
  • Online dating (Match.com)

A more formal view:

  • User - requests content
  • Objects - instances of content
  • Context - device, location, time, history
  • Interface - browser, mobile

Inferring Preferences

Unfortunately, users generally have a hard time explaining what types of content they prefer.

Some early systems worked by interviewing users to ask what they liked. Those systems did not work very well.

A very interesting article about the earliest personalization systems is User Modeling via Stereotypes by Elaine Rich, dating from 1979.


Instead, modern systems work by capturing user’s opinions about specific items.

This can be done actively:

  • When a user is asked to rate a movie, product, or experience,

or it can be done passively:

  • By noting which items a user chooses to purchase (for example).

Challenges

  • The biggest issue is scalability: typical data for this problem is huge.
    • Millions of objects
    • 100s of millions of users
  • Changing user base
  • Changing inventory (movies, stories, goods)
  • Available features
  • Imbalanced dataset
    • User activity / item reviews are power law distributed

This data is a subset of the data presented in: “From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews,” by J. McAuley and J. Leskovec. WWW, 2013

Example

Let’s look at a dataset for testing recommender systems consisting of Amazon movie reviews:

We’ll download a compressed pickle file containing the data if it is not already present.

Code
# This is a 647 MB file, delete it after use
import gdown
url = "https://drive.google.com/uc?id=14GakA7oOjbQp7nxcGApI86WlP3GrYTZI"
pickle_output = "train.pkl.gz"

import os.path
if not os.path.exists(pickle_output):
    gdown.download(url, pickle_output)
Downloading...
From (original): https://drive.google.com/uc?id=14GakA7oOjbQp7nxcGApI86WlP3GrYTZI
From (redirected): https://drive.google.com/uc?id=14GakA7oOjbQp7nxcGApI86WlP3GrYTZI&confirm=t&uuid=3b451c12-93a3-4844-b5cd-18c09d6da4c1
To: /home/runner/work/DS701-Course-Notes/DS701-Course-Notes/ds701_book/train.pkl.gz
  0%|          | 0.00/647M [00:00<?, ?B/s]  0%|          | 524k/647M [00:00<02:25, 4.43MB/s]  1%|          | 4.72M/647M [00:00<00:27, 23.5MB/s]  2%|▏         | 11.0M/647M [00:00<00:23, 26.8MB/s]  3%|▎         | 17.8M/647M [00:00<00:18, 34.6MB/s]  3%|▎         | 21.5M/647M [00:00<00:21, 29.2MB/s]  4%|▍         | 27.3M/647M [00:00<00:20, 29.6MB/s]  5%|▌         | 35.1M/647M [00:01<00:21, 28.8MB/s]  7%|▋         | 42.5M/647M [00:01<00:19, 30.4MB/s]  8%|▊         | 50.9M/647M [00:01<00:18, 32.8MB/s]  9%|▉         | 59.2M/647M [00:01<00:16, 35.2MB/s] 10%|█         | 67.6M/647M [00:02<00:20, 28.8MB/s] 12%|█▏        | 76.0M/647M [00:02<00:19, 29.4MB/s] 13%|█▎        | 84.4M/647M [00:02<00:17, 32.8MB/s] 14%|█▍        | 92.8M/647M [00:02<00:16, 34.3MB/s] 16%|█▌        | 101M/647M [00:03<00:18, 29.7MB/s]  17%|█▋        | 110M/647M [00:03<00:16, 32.7MB/s] 18%|█▊        | 118M/647M [00:03<00:15, 34.6MB/s] 20%|█▉        | 126M/647M [00:03<00:14, 36.1MB/s] 21%|██        | 135M/647M [00:04<00:13, 36.8MB/s] 22%|██▏       | 143M/647M [00:04<00:13, 37.5MB/s] 23%|██▎       | 152M/647M [00:04<00:11, 41.7MB/s] 25%|██▍       | 160M/647M [00:04<00:15, 30.9MB/s] 26%|██▌       | 168M/647M [00:05<00:14, 32.8MB/s] 27%|██▋       | 177M/647M [00:05<00:13, 35.4MB/s] 29%|██▊       | 185M/647M [00:05<00:12, 35.9MB/s] 30%|██▉       | 193M/647M [00:05<00:11, 38.0MB/s] 31%|███       | 202M/647M [00:06<00:11, 37.5MB/s] 32%|███▏      | 210M/647M [00:06<00:10, 41.6MB/s] 34%|███▍      | 219M/647M [00:06<00:10, 41.4MB/s] 36%|███▌      | 232M/647M [00:06<00:07, 57.0MB/s] 37%|███▋      | 239M/647M [00:06<00:09, 42.5MB/s] 39%|███▉      | 252M/647M [00:06<00:07, 49.6MB/s] 41%|████      | 265M/647M [00:07<00:06, 63.4MB/s] 42%|████▏     | 274M/647M [00:07<00:06, 56.6MB/s] 44%|████▍     | 286M/647M [00:07<00:06, 57.6MB/s] 46%|████▌     | 298M/647M [00:07<00:05, 69.2MB/s] 47%|████▋     | 307M/647M [00:07<00:05, 66.7MB/s] 49%|████▊     | 315M/647M [00:07<00:04, 68.1MB/s] 50%|█████     | 326M/647M [00:07<00:04, 78.8MB/s] 52%|█████▏    | 335M/647M [00:08<00:04, 72.2MB/s] 53%|█████▎    | 343M/647M [00:08<00:04, 71.7MB/s] 54%|█████▍    | 351M/647M [00:08<00:04, 64.9MB/s] 55%|█████▌    | 358M/647M [00:08<00:04, 64.3MB/s] 57%|█████▋    | 367M/647M [00:08<00:03, 70.4MB/s] 58%|█████▊    | 374M/647M [00:08<00:03, 69.3MB/s] 60%|█████▉    | 386M/647M [00:08<00:03, 81.1MB/s] 61%|██████    | 394M/647M [00:08<00:03, 81.1MB/s] 62%|██████▏   | 403M/647M [00:09<00:03, 62.5MB/s] 64%|██████▍   | 417M/647M [00:09<00:02, 79.4MB/s] 66%|██████▌   | 426M/647M [00:09<00:03, 58.9MB/s] 67%|██████▋   | 434M/647M [00:09<00:03, 57.3MB/s] 68%|██████▊   | 441M/647M [00:09<00:04, 47.1MB/s] 70%|███████   | 454M/647M [00:10<00:04, 40.5MB/s] 72%|███████▏  | 469M/647M [00:10<00:03, 57.1MB/s] 74%|███████▎  | 477M/647M [00:10<00:02, 58.1MB/s] 75%|███████▍  | 485M/647M [00:10<00:03, 50.6MB/s] 77%|███████▋  | 495M/647M [00:11<00:04, 36.8MB/s] 79%|███████▊  | 508M/647M [00:11<00:02, 48.5MB/s] 80%|███████▉  | 515M/647M [00:11<00:02, 50.6MB/s] 81%|████████  | 522M/647M [00:11<00:02, 51.4MB/s] 82%|████████▏ | 530M/647M [00:11<00:02, 55.0MB/s] 83%|████████▎ | 537M/647M [00:11<00:01, 55.8MB/s] 84%|████████▍ | 546M/647M [00:11<00:01, 61.8MB/s] 86%|████████▌ | 554M/647M [00:12<00:01, 48.9MB/s] 88%|████████▊ | 570M/647M [00:12<00:01, 70.7MB/s] 89%|████████▉ | 579M/647M [00:12<00:01, 64.7MB/s] 91%|█████████ | 587M/647M [00:12<00:01, 38.5MB/s] 92%|█████████▏| 596M/647M [00:13<00:01, 40.5MB/s] 94%|█████████▍| 607M/647M [00:13<00:00, 51.2MB/s] 95%|█████████▍| 614M/647M [00:13<00:00, 46.0MB/s] 97%|█████████▋| 629M/647M [00:13<00:00, 62.5MB/s] 98%|█████████▊| 637M/647M [00:13<00:00, 64.4MB/s]100%|█████████▉| 645M/647M [00:13<00:00, 62.6MB/s]100%|██████████| 647M/647M [00:13<00:00, 47.2MB/s]

We’ll load the data into a pandas DataFrame.

Code
import gzip
import time

start_time = time.time()
with gzip.open(pickle_output, 'rb') as f:
    df = pd.read_pickle(f)
elapsed_time = time.time() - start_time
print(f"Elapsed read time: {elapsed_time:.2f} seconds")
Elapsed read time: 9.27 seconds

Run df.info() to see the column names and data types.

Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1697533 entries, 0 to 1697532
Data columns (total 9 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Id                      int64  
 1   ProductId               object 
 2   UserId                  object 
 3   HelpfulnessNumerator    int64  
 4   HelpfulnessDenominator  int64  
 5   Score                   float64
 6   Time                    int64  
 7   Summary                 object 
 8   Text                    object 
dtypes: float64(1), int64(4), object(4)
memory usage: 116.6+ MB

Now we can count the number of users and movies:

Code
from IPython.display import display, Markdown

n_users = df["UserId"].unique().shape[0]
n_movies = df["ProductId"].unique().shape[0]
n_reviews = len(df)
display(Markdown(f'There are:\n'))
display(Markdown(f'* {n_reviews:,} reviews\n* {n_movies:,} movies\n* {n_users:,} users'))

display(Markdown(f'There are {n_users * n_movies:,} potential reviews, meaning sparsity of {(n_reviews/(n_users * n_movies)):0.4%}'))

There are:

  • 1,697,533 reviews
  • 50,052 movies
  • 123,960 users

There are 6,204,445,920 potential reviews, meaning sparsity of 0.0274%

where

\[ \text{sparsity} = \frac{\text{\# of reviews}}{\text{\# of users} \times \text{\# of movies}} = \frac{\text{\# of reviews}}{\text{\# of potential reviews}} \]

Reviews are Sparse

Only 0.02% of the reviews are available – 99.98% of the reviews are missing.

Code
display(Markdown(f'There are on average {n_reviews/n_movies:0.1f} reviews per movie' +
     f' and {n_reviews/n_users:0.1f} reviews per user'))

There are on average 33.9 reviews per movie and 13.7 reviews per user

Sparseness is Skewed

Although on average a movie receives 34 reviews, almost all movies have even fewer reviews.

Code
reviews_per_movie = df.groupby('ProductId').count()['Id'].values
frac_below_mean = np.sum(reviews_per_movie < (n_reviews/n_movies))/len(reviews_per_movie)
plt.plot(sorted(reviews_per_movie, reverse=True), '.-')
xmin, xmax, ymin, ymax = plt.axis()
plt.hlines(n_reviews/n_movies, xmin, xmax, 'r', lw = 3)
plt.ylabel('Number of Ratings', fontsize = 14)
plt.xlabel('Movie', fontsize = 14)
plt.legend(['Number of Ratings', 'Average Number of Ratings'], fontsize = 14)
plt.title(f'Amazon Movie Reviews\nNumber of Ratings Per Movie\n' +
          f'{frac_below_mean:0.0%} of Movies Below Average', fontsize = 16);


Likewise, although the average user writes 14 reviews, almost all users write even fewer reviews.

Code
reviews_per_user = df.groupby('UserId').count()['Id'].values
frac_below_mean = np.sum(reviews_per_user < (n_reviews/n_users))/len(reviews_per_user)
plt.plot(sorted(reviews_per_user, reverse=True), '.-')
xmin, xmax, ymin, ymax = plt.axis()
plt.hlines(n_reviews/n_users, xmin, xmax, 'r', lw = 3)
plt.ylabel('Number of Ratings', fontsize = 14)
plt.xlabel('User', fontsize = 14)
plt.legend(['Number of Ratings', 'Average Number of Ratings'], fontsize = 14)
plt.title(f'Amazon Movie Reviews\nNumber of Ratings Per User\n' +
          f'{frac_below_mean:0.0%} of Users Below Average', fontsize = 16);

Objective Function

Ultimately, our goal is to predict the rating that a user would give to an item.

For that, we need to define a loss or objective function.

A typical objective function is root mean square error (RMSE)

\[ \text{RMSE} = \sqrt{\frac{1}{|S|} \sum_{(i,u)\in S} (\hat{r}_{ui} - r_{ui})^2}, \]

where

  • \(r_{ui}\) is the rating that user \(u\) gives to item \(i\), and
  • \(S\) is the subset of items that have ratings.

OK, now we know the problem and the data available. How can we address the problem?

The earliest method developed is called collaborative filtering.

Colaborative Filtering

Collaborative Filtering

The central idea of collaborative filtering is that the set of known recommendations can be considered to be a bipartite graph.

The nodes of the bipartite graph are users (\(U\)) and items (\(V\)).

Each edge corresponds to a known rating \(r_{ui}.\)


Then recommendations are formed by traversing or processing the bipartite graph.

There are at least two ways this graph can be used.


Two ways to form a rating for item \((u, i)\):

  1. Using user-user similarity:
    • look at users that have similar item preferences to user \(u\)
    • look at how those users rated item \(i\)

Good for many users, fewer items.
(e.g., NetFix had ~280M subscribers, ~6.5K movies/shows)

  1. Using item-item similarity:
    • look at other items that have been liked by similar users as item \(i\)
    • look at how user \(u\) rated those items

Good for many items, fewer users
(e.g. Amazon had ~300M accounts, ~600M products)

Item-Item CF

For item-item similarity, we’ll look at item-item Collaborative Filtering (CF).

The questions are:

  • How do we judge “similarity” of items?
  • How do we form a predicted rating?

Here is another view of the ratings graph, this time as a matrix that includes missing entries:


Let’s say we want to predict the value of this unknown rating:


We’ll consider two other items, namely items 3 and 6 (for example).

Note that we are only interested in items that this user has rated.


We will discuss strategies for assessing similarity shortly.

How did we choose these two items?

We used \(k\)-nearest neighbors. Here \(k\) = 2.

For now, let’s just say we determine the similarities as:

\[ s_{13} = 0.2 \]

\[ s_{16} = 0.3 \]


These similarity scores tell us how much weight to put on the rating of the other items.

So we can form a prediction of \(\hat{r}_{15}\) as:

\[ \hat{r}_{15} = \frac{s_{13} \cdot r_{35} + s_{16} \cdot r_{65}}{s_{13} + s_{16}} = \frac{0.2 \cdot 2 + 0.3 \cdot 3}{0.2 + 0.3} = 2.6 \]

Similarity

How should we assess similarity of items?

A reasonable approach is to consider items similar if their ratings are correlated, for example using the Pearson correlation coefficient \(r\)1.

However, note that two items will not have ratings in the same positions.

So we want to compute correlation only over the users who rated both the items.

Example

Let’s compute the Pearson correlation coefficient for the above two items:

Code
import numpy as np
from IPython.display import display, Markdown

ratings_item_i = [1, np.nan, np.nan, 5, 5, 3, np.nan, np.nan, np.nan, 4, 2, np.nan, np.nan, np.nan, np.nan, 4, np.nan, 5, 4, 1, np.nan]
ratings_item_j = [np.nan, np.nan, 4, 2, 5, np.nan, np.nan, 1, 2, 5, np.nan, np.nan, 2, np.nan, np.nan, 3, np.nan, np.nan, np.nan, 5, 4]

display(Markdown(f'Ratings for item $i$:\n\n{ratings_item_i}'))
display(Markdown(f'Ratings for item $j$:\n\n{ratings_item_j}'))

Ratings for item \(i\):

[1, nan, nan, 5, 5, 3, nan, nan, nan, 4, 2, nan, nan, nan, nan, 4, nan, 5, 4, 1, nan]

Ratings for item \(j\):

[nan, nan, 4, 2, 5, nan, nan, 1, 2, 5, nan, nan, 2, nan, nan, 3, nan, nan, nan, 5, 4]


Let’s drop the non-common ratings:

Code
# Create new lists where only numbers are kept that are not np.nan in both lists
filtered_ratings_item_i = [rating_i for rating_i, rating_j in zip(ratings_item_i, ratings_item_j) if not np.isnan(rating_i) and not np.isnan(rating_j)]
filtered_ratings_item_j = [rating_j for rating_i, rating_j in zip(ratings_item_i, ratings_item_j) if not np.isnan(rating_i) and not np.isnan(rating_j)]

display(Markdown(f'Common ratings for item $i$: {filtered_ratings_item_i}'))
display(Markdown(f'Common ratings for item $j$: {filtered_ratings_item_j}'))

Common ratings for item \(i\): [5, 5, 4, 4, 1]

Common ratings for item \(j\): [2, 5, 5, 3, 5]

Now we can compute the Pearson correlation coefficient:

Code
display(Markdown(f'Pearson correlation coefficient: {np.corrcoef(filtered_ratings_item_i, filtered_ratings_item_j)[0, 1]:0.2f}'))

Pearson correlation coefficient: -0.43

Pearson Correlation Coefficient

The Pearson correlation coefficient, often denoted as \(r\), is a measure of the linear correlation between two variables \(X\) and \(Y\). It quantifies the degree to which a linear relationship exists between the variables. The value of \(r\) ranges from -1 to 1, where:

  • \(r = 1\) indicates a perfect positive linear relationship,
  • \(r = -1\) indicates a perfect negative linear relationship,
  • \(r = 0\) indicates no linear relationship.

The formula for the Pearson correlation coefficient is:

\[ r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}} \]

Where: - \(X_i\) and \(Y_i\) are the individual sample points, - \(\bar{X}\) and \(\bar{Y}\) are the means of the \(X\) and \(Y\) samples, respectively.

The Pearson correlation coefficient is sensitive to outliers and assumes that the relationship between the variables is linear and that the data is normally distributed.

Similarity for Binary Data

In some cases we will need to work with binary \(r_{ui}\).

For example, purchase histories on an e-commerce site, or clicks on an ad.

In this case, an appropriate replacement for Pearson \(r\) is the Jaccard similarity coefficient or Intersection over Union.

\[ J_{Sim}(\mathbf{x}, \mathbf{y}) = \frac{|\mathbf{x} \cap \mathbf{y}|}{|\mathbf{x} \cup \mathbf{y}|}. \]

See the lecture on similarity measures.

Improving CF in the presence of bias

One problem with the story so far arises due to bias.

  • Some items are significantly higher or lower rated
  • Some users rate substantially higher or lower in general

These properties interfere with similarity assessment.

Bias correction is crucial for CF recommender systems.

We need to include

  • Per-user offset
  • Per-item offset
  • Global offset

Representing biases

Hence we need to form a per-item bias of:

\[ b_{ui} = \mu + \alpha_u + \beta_i \]

where

  • \(\mu\) is the global average rating across all items and users.
  • \(\alpha_u\) is the offset of user \(u\) and
  • \(\beta_i\) is the offset of item \(i\).

If we gather all these elements together we can form:

  • \(\boldsymbol{\alpha}\) an \(n\times 1\) vector of per-user offsets, and
  • \(\boldsymbol{\beta}\) an \(m\times 1\) vector of per-item offsets.

Estimating biases

How can we estimate the parameters \(\boldsymbol{\alpha}\), \(\boldsymbol{\beta}\), and \(\mu\)?

Let’s assume for a minute that we had a fully-dense matrix of ratings \(R\).

Each of the \(m\) rows of \(R\) represents an item and each of the \(n\) columns a user.

Then what we want to estimate is

\[ \min_{\alpha,\beta,\mu} \Vert R - \mathbf{1}\alpha^T + \beta\mathbf{1}^T + \mu1\Vert^2 + \lambda(\Vert\alpha\Vert^2 + \Vert\beta\Vert^2). \]

Here, bold-faced \(\mathbf{1}\) represents appropriately sized vectors of ones, and non-boldfaced \(1\) is an \(m\times n\) matrix of ones.


While this is not a simple ordinary least squares problem, there is a strategy for solving it.

Assume we hold \(\beta\mathbf{1}^T + \mu1\) constant.

Then the remaining problem is

\[ \min_{\alpha} \Vert R - \mathbf{1}\alpha^T \Vert^2 + \lambda \Vert\alpha\Vert^2, \]

which (for each column of \(R\)) is a standard regularized least squares problem solved via Ridge regression.


This sort of problem is called jointly convex in that it is convex in each of the variables \(\alpha\), \(\beta\), and \(\mu\).

The strategy for solving is:

  1. Hold \(\alpha\) and \(\beta\) constant, solve for \(\mu\).
  2. Hold \(\alpha\) and \(\mu\) constant, solve for \(\beta\).
  3. Hold \(\beta\) and \(\mu\) constant, solve for \(\alpha\).

Each of the three steps will reduce the overall error. As a result, we iterate over them until convergence.


The last issue is that the matrix \(R\) is not dense - in reality we only have a small subset of its entries.

We simply need to adapt the least-squares solution to only consider the entries in \(R\) that we know.


As a result, the actual calculation is:

Step 1:

\[ \mu = \frac{\sum_{(u, i) \in R} (r_{ui} - \alpha_u - \beta_i)}{|R|} \]

Step 2:

\[ \alpha_u = \frac{\sum_{i \in R(u)}(r_{ui} - \mu - \beta_i)}{\lambda + |R(u)|} \]

Step 3:

\[ \beta_i = \frac{\sum_{u \in R(i)}(r_{ui} - \mu - \alpha_u)}{\lambda + |R(i)|} \]

Step 4: If not converged, go to Step 1.

Here \(i \in R(u)\) means the set of items rated by user \(u\) and \(u \in R(i)\) means the set of users who have rated item \(i\) and \(|R(u)|\) is the number of ratings.


Now that we have learned the biases, we can do a better job of estimating correlation:

\[ \hat{\rho}_{ij} = \frac{\sum_{u\in U(i,j)}(r_{ui} - b_{ui})(r_{uj}-b_{uj})} {\sqrt{\sum_{u\in U(i,j)}(r_{ui} - b_{ui})^2\sum_{u\in U(i,j)}(r_{uj}-b_{uj})^2}}, \]

where

  • \(b_{ui} = \mu + \alpha_u + \beta_i\), and
  • \(U(i,j)\) are the users who have rated both \(i\) and \(j\).

And using biases we can also do a better job of estimating ratings:

\[ \hat{r}_{ui} = b_{ui} + \frac{\sum_{j \in n_k(i, u)} s_{ij}(r_{uj} - b_{uj})}{\sum_{j \in n_k(i, u)} s_{ij}}, \]

where

  • \(b_{ui} = \mu + \alpha_u + \beta_i\),
  • \(n_k(i, u)\) are the \(k\) nearest neighbors to \(i\) that were rated by user \(u\) and
  • \(s_{ij}\) is the similarity between items \(i\) and \(j\), estimated as above.

Assessing CF

This completes the high level view of CF.

Working with user-user similarities is analogous.

Strengths:

  • Essentially no training.
    • The reliance on \(k\)-nearest neighbors helps in this respect.
  • Easy to update with new users, items, and ratings.
  • Explainable:
    • “We recommend Minority Report because you liked Blade Runner and Total Recall.

Weaknesses:

  • Accuracy can be a problem – resulting in poor recommendations
  • Scalability can be a problem – compute grows (think \(k\)-NN)

Matrix Factorization

Matrix Factorization

Note that standard CF forces us to consider similarity among items, or among users, but does not take into account both.

Can we use both kinds of similarity simultaneously?

We can’t use both the rows and columns of the ratings matrix \(R\) at the same time – the user and item vectors live in different vector spaces.


What we could try to do is find a single vector space in which we represent both users and items, along with a similarity function, such that:

  • users who have similar item ratings are similar in the vector space
  • items who have similar user ratings are similar in the vector space
  • when a given user highly rates a given item, that user and item are similar in the vector space.

We saw this idea previously, in an SVD lecture.

This new vector space is called a latent space, and the user and item representations are called latent vectors.

This notion of a shared latent space is also central to deep learning recommender approaches (Naumov et al. 2019) we will look at later.


Now, however, we are working with a matrix which is only partially observed. That is, we only know some of the entries in the ratings matrix.

Nonetheless, we can imagine a situation like this:

where we decompose the ratings matrix \(R\) into two matrices.

We want the product of the two matrices to be as close as possible to the known values of the ratings matrix.


What this setup implies is that our similarity function is the inner product.

Which means that to predict an unknown rating, we take the inner product of latent vectors:

Taking, for example, the 2nd row of “items” and the 5th row of “users”…


We have

\[ (-0.5 \cdot -2)+(0.6 \cdot 0.3)+(0.5 \cdot 2.4) = 2.43, \]

so:

Solving Matrix Factorization

Notice that in this case we’ve decided that the factorization should be rank 3, i.e., low-rank.

So we want something like an SVD.

(Recall that SVD gives us the most-accurate-possible low-rank factorization of a matrix).

However, we can’t use the SVD algorithm directly, because we don’t know all the entries in \(R\).

Indeed, the unseen entries in \(R\) are exactly what we want to predict.


Here is what we want to solve:

\[ \min_{U,V} \Vert (R - UV^T)_S\Vert^2 + \lambda(\Vert U\Vert^2 + \Vert V\Vert^2) \]

where:

  • \(R\) is \(m\times n\),
  • \(U\) is the \(m\times k\) items matrix,
  • \(V\) is the \(n\times k\) users matrix and
  • \(k\) is the rank of the factorization and dimensionality of the latent space.

The \((\cdot)_S\) notation means that we are only considering the subset of matrix entries that correspond to known reviews (the set \(S\)).

Note that as usual, we add \(\ell_2\) penalization to avoid overfitting (Ridge regression).


Once again, this problem is jointly convex in that it is convex in each of the variables \(U\) and \(V\).

In particular, if we hold either \(U\) or \(V\) constant, then the result is a simple ridge regression.


So one commonly used algorithm for this problem is called alternating least squares (ALS):

  1. Hold \(U\) constant, and solve for \(V\)
  2. Hold \(V\) constant, and solve for \(U\)
  3. If not converged, go to Step 1.

The only thing we’ve left out at this point is how to deal with the missing entries of \(R\).

It’s not hard, but the details aren’t that interesting, so we’ll give you code instead!

ALS in Practice

The entire Amazon reviews dataset is too large to work with easily, and it is too sparse.

Hence, we will take the densest rows and columns of the matrix.

Code
# The densest columns: products with more than 50 reviews
pids = df.groupby('ProductId').count()['Id']
hi_pids = pids[pids > 50].index

# reviews that are for these products
hi_pid_rec = [r in hi_pids for r in df['ProductId']]

# the densest rows: users with more than 50 reviews
uids = df.groupby('UserId').count()['Id']
hi_uids = uids[uids > 50].index

# reviews that are from these users
hi_uid_rec = [r in hi_uids for r in df['UserId']]

# The result is a list of booleans equal to the number of rewviews
# that are from those dense users and movies
goodrec = [a and b for a, b in zip(hi_uid_rec, hi_pid_rec)]

Now we create a \(\textnormal{UserID} \times \textnormal{ProductID}\) matrix from these reviews.

Missing entries will be filled with NaNs.

Code
dense_df = df.loc[goodrec]
good_df = dense_df.loc[~df['Score'].isnull()]
R = good_df.pivot_table(columns = 'ProductId', index = 'UserId', values = 'Score')

And we can look at a small part of the matrix:

Code
R.iloc[900:910, 1000:1010]
ProductId 1417024917 1417030321 1417030976 1417054069 1417065818 1417068329 1417070471 1419802356 1419807587 1419815644
UserId
A1WLZYEOIL1HLT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A1WNJVA59HLMO5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A1WR12AC35R3K6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A1WSFHRBY2ZD1R NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A1WUMTJOASEL5F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A1WVA7V02PQOY6 NaN NaN NaN NaN 5.0 NaN NaN NaN NaN NaN
A1WX42M589VAMQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A1X054KUYG5V NaN 3.0 NaN NaN NaN 4.0 NaN NaN NaN NaN
A1X12F0KC8GY6M 4.0 5.0 NaN NaN NaN NaN NaN NaN NaN NaN
A1X15AQVSCKKRG NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

We’ll use code from the Antidote Data Framework to do the matrix factorization and ALS. We have local copies recommender_MF.py, recommender_als.py and recommender_lmafit.py in our repository.

Code
# Import local python package MF.py
import recommender_MF as MF

# Instantiate the model
# We are pulling these hyperparameters out of the air -- that's not the right way to do it!
RS = MF.als_MF(rank = 20, lambda_ = 1)
Code
%time pred, error = RS.fit_model(R)
CPU times: user 1min 7s, sys: 4.28 s, total: 1min 11s
Wall time: 46.2 s
Code
print(f'RMSE on visible entries (training data): {np.sqrt(error/R.count().sum()):0.3f}')
RMSE on visible entries (training data): 0.343

And we can look at the predicted ratings matrix and see that it is a dense matrix:

Code
pred
ProductId 0005019281 0005119367 0307142485 0307142493 0307514161 0310263662 0310274281 0718000315 0764001035 0764003828 ... B00IKM5OCO B00IWULQQ2 B00J4LMHMK B00J5JSV1W B00JA3RPAG B00JAQJMJ0 B00JBBJJ24 B00JKPHUE0 B00K2CHVJ4 B00L4IDS4W
UserId
A02755422E9NI29TCQ5W3 5.160166 5.485254 5.216541 4.882483 5.554725 5.410966 4.859719 4.288049 4.665462 4.909857 ... 3.767078 5.087916 5.366999 2.512491 4.273543 5.009023 4.753608 5.142484 4.014472 4.968972
A100JCBNALJFAW 3.412105 3.744585 3.071588 2.812383 3.977737 4.068442 4.728968 3.034348 3.806690 2.558996 ... 1.989275 3.774524 2.967807 1.506393 3.071853 2.692521 3.645560 0.405617 2.373988 2.260089
A10175AMUHOQC4 4.795672 5.008185 4.738192 4.490579 5.487747 4.674548 4.558869 4.391259 4.936988 3.871651 ... 3.646807 4.867140 5.994464 1.694122 3.866068 4.729658 5.032468 3.978302 4.371713 5.093678
A103KNDW8GN92L 4.747010 5.052763 3.904867 4.863294 4.291550 4.345711 4.585568 4.012744 4.769008 5.532848 ... 4.209616 4.569735 4.669492 2.669940 3.862243 3.676272 4.060489 3.466910 4.092287 3.587281
A106016KSI0YQ 4.387333 4.805163 3.812473 4.969698 4.862461 4.018651 4.095438 3.850731 3.346309 4.515175 ... 3.847274 4.333298 3.809869 2.479568 3.771813 4.994529 5.102042 4.307609 3.593250 3.565460
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
AZUBX0AYYNTFF 4.038671 4.390536 3.875080 4.162316 4.323401 3.942158 4.114244 3.217396 3.116887 3.329173 ... 2.532802 3.726081 2.662176 2.108711 2.764652 3.977901 4.191627 3.184184 2.412499 2.828415
AZXGPM8EKSHE9 2.433849 3.424032 3.272877 3.538645 2.119925 3.445304 2.732960 2.339688 3.750510 2.224514 ... 0.882552 3.252535 2.276738 3.308828 3.159818 3.429537 3.044300 3.026746 2.043627 3.710099
AZXHK8IO25FL6 3.927449 4.271830 3.756855 3.237490 3.709682 2.173375 3.983762 3.087339 3.483337 3.409457 ... 3.784209 3.612386 3.676185 1.538733 2.952155 4.562115 3.293394 2.424861 1.135759 2.891152
AZXR5HB99P936 4.464123 4.704419 3.649212 3.934679 5.211955 4.227345 4.553247 3.762468 3.163137 3.565466 ... 3.115093 3.980199 4.028849 2.118895 3.105410 4.303757 4.241257 3.670543 3.001280 3.352112
AZZ4GD20C58ND 3.990167 4.346322 4.375355 3.453328 4.776714 4.549703 4.633570 3.620664 4.371348 2.872648 ... 3.147763 4.447010 3.998075 2.779356 2.952121 4.786318 3.901312 2.024843 3.625450 3.043700

3677 rows × 7244 columns


Code
## todo: hold out test data, compute oos error

# We create a mask of the known entries, then calculate the indices of the known
# entries, then split that data into training and test sets.

# Create a mask for the known entries
RN = ~R.isnull()

# Get the indices of the known entries
visible = np.where(RN)

# Split the data into training and test sets
import sklearn.model_selection as model_selection
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(visible[0], visible[1], test_size = 0.1)

Just for comparison’s sake, let’s check the performance of \(k\)-NN on this dataset.

Again, this is only on the training data – so overly optimistic for sure.

And note that this is a subset of the full dataset – the subset that is “easiest” to predict due to density.

Code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error

# Drop the columns that are not features
X_train = good_df.drop(columns=['Id', 'ProductId', 'UserId', 'Text', 'Summary'])

# The target is the score
y_train = good_df['Score']

# Using k-NN on features HelpfulnessNumerator, HelpfulnessDenominator, Score, Time
model = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
%time y_hat = model.predict(X_train)
CPU times: user 9.41 s, sys: 228 ms, total: 9.64 s
Wall time: 9.42 s
Code
print(f'RMSE on visible entries (test set): {mean_squared_error(y_train, y_hat, squared = False):0.3f}')
RMSE on visible entries (test set): 0.648
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/sklearn/metrics/_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.
  warnings.warn(

Assessing Matrix Factorization

Matrix Factorization per se is a good idea.
However, many of the improvements we’ve discussed for CF apply to MF as well.

To illustrate, we’ll look at some of the successive improvements used by the team that won the Netflix prize (“BellKor’s Pragmatic Chaos”).

When the prize was announced, the Netflix supplied solution achieved an RMSE of 0.951.

By the end of the competition (about 3 years), the winning team’s solution achieved RMSE of 0.856.

Let’s restate our MF objective in a way that will make things clearer:

\[ \min_{U, V} \sum_{(u, i)\in S}(r_{ui} - u_u^Tv_i)^2 + \lambda(\Vert U\Vert^2 + \Vert V\Vert^2) \]

where we have written out the vector \(\ell_2\) norm as the summation.

1. Adding Biases

If we add biases: \[ \min_{U, V} \sum_{(u, i)\in S}(r_{ui} - (\mu + \alpha_u + \beta_i + u_u^Tv_i)^2 + \lambda(\Vert U\Vert^2 + \Vert V\Vert^2 + \Vert \alpha\Vert^2 + \Vert \beta \Vert^2) \]

we see improvements in accuracy:

Figure 2: Matrix factorization models’ accuracy. The plots show the root-mean-square error of each of four individual factor models (lower is better). Accuracy improves when the factor model’s dimensionality (denoted by numbers on the charts) increases. In addition, the more refined factor models, whose descriptions involve more distinct sets of parameters, are more accurate.

2. Who Rated What?

In reality, ratings are not provided at random.

Take note of which users rated the same movies (ala CF) and use this information.

3. Ratings Change Over Time

Older movies tend to get higher ratings!


If we add time-varying biases:

\[ \min_{U, V} \sum_{(u, i)\in S}(r_{ui} - (\mu + \alpha_u(t) + \beta_i(t) + u_u^Tv_i(t))^2 + \lambda(\Vert U\Vert^2 + \Vert V\Vert^2 + \Vert \alpha\Vert^2 + \Vert \beta \Vert^2) \]

we see further improvements in accuracy:

To estimate these billions of parameters, we cannot use alternating least squares or any linear algebraic method.

We need to use gradient descent (which we covered previously).

Deep Learning for Recommender Systems

Deep Learning for Recommender Systems

Besides the Collaborative Filtering and Matrix Factorization models, other popular approaches to building recommender systems use Deep Learning.

We’ll look at the Deep Learning Recommender Model (DLRM) proposed by Facebook in 2019 (Naumov et al. 2019) with GitHub repository.

DLRM Architecture

  • Components (Figure 3):
    1. Embeddings: Dense representations for categorical data.
    2. Bottom MLP: Transforms dense continuous features.
    3. Feature Interaction: Dot-product of embeddings and dense features.
    4. Top MLP: Processes interactions and outputs probabilities.

Let’s look at each of these components in turn.

Embeddings

Embeddings: Map categorical inputs to latent factor space.

  • A learned embedding matrix \(W \in \mathbb{R}^{m \times d}\) for each category of input
  • One-hot vector \(e_i\) with \(i\text{-th}\) entry 1 and rest are 0s
  • Embedding of \(e_i\) is \(i\text{-th}\) row of \(W\), i.e., \(w_i^T = e_i^T W\)

We can also use weighted combination of multiple items with a multi-hot vector of weights \(a^T = [0, ..., a_{i_1}, ..., a_{i_k}, ..., 0]\).

The embedding of this multi-hot vector is then \(a^T W\).

DLRM Architecture

DLRM Architecture

PyTorch has a convenient way to do this using EmbeddingBag, which besides summing can combine embeddings via mean or max pooling.

Here’s an example with 5 embeddings of dimension 3:

Code
import torch
import torch.nn as nn

# Example embedding matrix: 5 embeddings, each of dimension 3
embedding_matrix = nn.EmbeddingBag(num_embeddings=5, embedding_dim=3, mode='mean')

# Input: Indices into the embedding matrix
input_indices = torch.tensor([1, 2, 3, 4])  # Flat list of indices
offsets = torch.tensor([0, 2])  # Start new bag at position 0 and 2 in input_indices

# Forward pass
output = embedding_matrix(input_indices, offsets)

print("Embedding Matrix:\n", embedding_matrix.weight)
print("Output:\n", output)
Embedding Matrix:
 Parameter containing:
tensor([[-1.0676, -0.5768,  0.1240],
        [-0.5036,  0.7992,  1.5648],
        [ 0.0703, -0.7911, -1.2000],
        [ 0.9630,  0.4617,  0.2724],
        [ 0.2938,  1.2356, -0.8772]], requires_grad=True)
Output:
 tensor([[-0.2166,  0.0040,  0.1824],
        [ 0.6284,  0.8487, -0.3024]], grad_fn=<EmbeddingBagBackward0>)

Dense Features

The advantage of the DLRM architecture is that it can take continuous features as input such as the user’s age, time of day, etc.

There is a bottom MLP that transforms these dense features into a latent space of the same dimension \(d\).

DLRM Architecture

DLRM Architecture

Optional Sparse Feature MLPs

Optionally, one can add MLPs to transform the sparse features as well.

DLRM Architecture

DLRM Architecture

Feature Interactions

The 2nd order interactions are modeled via dot-products of all pairs from the collections of embedding vectors and processed dense features.

The results of the dot-product interactions are concatenated with the processed dense vectors.

DLRM Architecture

DLRM Architecture

Top MLP

The concatenated vector is then passed to a final MLP and then to a sigmoid function to produce the final prediction (e.g., probability score of recommendation)

This entire model is trained end-to-end using standard deep learning techniques.

DLRM Architecture

DLRM Architecture

Training Results

Figure 4: DLRM Training Results

Figure 4 shows the training (solid) and validation (dashed) accuracies of DLRM on the Criteo Ad Kaggle dataset.

Accuracy is compared with Deep and Cross network (DCN) (Wang et al. 2017).

Other Modern Approaches

There are many other modern approaches to recommender systems for example:

  1. Graph-Based Recommender Systems:
    • Leverage graph structures to capture relationships between users and items.
    • Use techniques like Graph Neural Networks (GNNs) to enhance recommendation accuracy.
  2. Context-Aware Recommender Systems:
    • Incorporate contextual information such as time, location, and user mood to provide more personalized recommendations.
    • Contextual data can be integrated using various machine learning models.
  3. Hybrid Recommender Systems:
    • Combine multiple recommendation techniques, such as collaborative filtering and content-based filtering, to improve performance.
    • Aim to leverage the strengths of different methods while mitigating their weaknesses.
  4. Reinforcement Learning-Based Recommender Systems:
    • Use reinforcement learning to optimize long-term user engagement and satisfaction.
    • Models learn to make sequential recommendations by interacting with users and receiving feedback.

These approaches often leverage advancements in machine learning and data processing to provide more accurate and personalized recommendations.

See (Ricci, Rokach, and Shapira 2022) for a comprehensive overview of recommender systems.

Assessing Recommender Systems

Assessing Recommender Systems

There are a number of concerns with the widespread use of recommender systems and personalization in society.

First, recommender systems are accused of creating filter bubbles.

A filter bubble is the tendency for recommender systems to limit the variety of information presented to the user.

The concern is that a user’s past expression of interests will guide the algorithm in continuing to provide “more of the same.”

This is believed to increase polarization in society, and to reinforce confirmation bias.


Second, recommender systems in modern usage are often tuned to maximize engagement.

In other words, the objective function of the system is not to present the user’s most favored content, but rather the content that will be most likely to keep the user on the site.

The incentive to maximize engagement arises on sites that are supported by advertising revenue.

More engagement time means more revenue for the site.


However, many studies have shown that sites that strive to maximize engagement do so in large part by guiding users toward extreme content:

  • content that is shocking,
  • or feeds conspiracy theories,
  • or presents extreme views on popular topics.

Given this tendency of modern recommender systems, for a third party to create “clickbait” content such as this, one of the easiest ways is to present false claims.

Methods for addressing these issues are being very actively studied at present.

Ways of addressing these issues can be:

  • via technology
  • via public policy

Recap and References

BU CS/CDS Research

You can read about some of the work done in Professor Mark Crovella’s group on this topic:

Recap

  • Introduction to recommender systems and their importance in modern society.
  • Explanation of collaborative filtering (CF) and its two main approaches: user-user similarity and item-item similarity.
  • Discussion on the challenges of recommender systems, including scalability and data sparsity.
  • Introduction to matrix factorization (MF) as an improvement over CF, using latent vectors and alternating least squares (ALS) for optimization.
  • Practical implementation of ALS for matrix factorization on a subset of Amazon movie reviews.
  • Review of Deep Learning Recommender Model (DLRM) architecture and its components.
  • Discussion on the societal impact of recommender systems, including filter bubbles and engagement maximization.

References

Koren, Yehuda. 2009. “Collaborative Filtering with Temporal Dynamics.” In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 42:447–56. 8. https://dl.acm.org/doi/10.1145/1557019.1557072.
Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. “Matrix Factorization Techniques for Recommender Systems.” Computer 42 (8): 30–37. https://ieeexplore.ieee.org/document/5197422.
Naumov, Maxim et al. 2019. “Deep Learning Recommendation Model for Personalization and Recommendation Systems.” arXiv Preprint arXiv:1906.00091, May. http://arxiv.org/abs/1906.00091.
Rastegarpanah, Bashir, Krishna P. Gummadi, and Mark Crovella. 2019. “Fighting Fire with Fire: Using Antidote Data to Improve Polarization and Fairness of Recommender Systems.” In Proceedings of WSDM. http://www.cs.bu.edu/faculty/crovella/paper-archive/wsdm19-antidote-data.pdf.
Ricci, Francesco, Lior Rokach, and Bracha Shapira, eds. 2022. Recommender Systems Handbook. New York, NY: Springer US. https://doi.org/10.1007/978-1-0716-2197-4.
Spinelli, Larissa, and Mark Crovella. 2017. “Closed-Loop Opinion Formation.” In Proceedings of the 9th International ACM Web Science Conference (WebSci). http://www.cs.bu.edu/faculty/crovella/paper-archive/netsci17-filterbubble.pdf.
———. 2020. “How YouTube Leads Privacy-Seeking Users Away from Reliable Information.” In Proceedings of the Workshop on Fairness in User Modeling, Adaptation, and Personalization (FairUMAP). http://www.cs.bu.edu/faculty/crovella/paper-archive/youtube-fairumap20.pdf.
Wang, Ruoxi, Bin Fu, Gang Fu, and Mingliang Wang. 2017. “Deep & Cross Network for Ad Click Predictions.” In Proc. ADKDD, 12.
Back to top

Footnotes

  1. See the Course Notes for the definition of Pearson correlation coefficient.↩︎