Recommender Systems – II

Introduction

In Part II, we will:

  • Study a modern deep learning approach:
    • Deep Learning Recommender Model (DLRM)
    • Connection to Matrix Factorization
    • Architecture and components
    • System challenges and optimizations
  • Reflect on the societal impact of recommender systems

This section draws heavily on

  • Deep Learning Recommender Model for Personalization and Recommendation Systems, (Naumov et al. 2019)

Deep Learning for Recommender Systems

Why Deep Learning for Recommendations?

Modern recommender systems face unique challenges:

Scale:

  • Billions of users and items
  • Massive embedding tables (100s of GB to TB)
  • Over 79% of AI inference cycles in production data centers (Naumov et al. 2019)

Data Types:

  • Dense features: continuous (age, time, etc.)
  • Sparse features: categorical (user ID, item ID)
  • Need to handle both efficiently

The Economic Impact

Recommender systems drive substantial business value:

  • Amazon: Up to 35% of revenue attributed to recommendations
  • Netflix: 75% of movies watched come from recommendations
  • Meta/Facebook: Core infrastructure for content ranking and ads

This economic importance motivates sophisticated deep learning approaches.

DLRM: Unifying Two Traditions

The Deep Learning Recommender Model (DLRM) (Naumov et al. 2019) synthesizes two historical approaches:

Tradition Key Concept DLRM Component
RecSys/Collaborative Filtering Latent factors, Matrix Factorization Embeddings & Dot Products
Predictive Analytics Statistical models → Deep Networks MLPs for feature processing

This fusion creates a model that efficiently handles both sparse and dense features.

Connection to Matrix Factorization

Recall from Part I that Matrix Factorization approximates \(R \approx WV^T\):

  • User matrix \(W\) and item matrix \(V\) can be viewed as embedding tables
  • The dot product \(w_i^T v_j\) predicts the rating
  • DLRM generalizes this: embeddings for many categorical features, not just users/items

This is why DLRM uses dot products in the interaction layer—it’s a principled way to model feature interactions based on collaborative filtering theory.

Training and Evaluation Dataset

Criteo Ad Click-Through Rate (CTR) challenge: Kaggle Challenge, Dataset on HF

🏗️ Dataset Construction

  • Each row represents a display ad served by Criteo covering 24 days
  • The first column indicates whether the ad was clicked (1) or not clicked (0).
  • Both positive (clicked) and negative (non-clicked) examples have been subsampled, though at different rates to keep business confidentiality.

🧱 Features

  • 13 integer features
    Mostly count-based; represent numerical properties of the ad, user, or context.

  • 26 categorical features
    Values are hashed into 32-bit integers for anonymization.
    The semantic meaning of these features is undisclosed.

DLRM Architecture Overview

DLRM is a dual-path architecture combining sparse and dense feature processing:

Components (Figure 1):

  1. Embeddings: Map sparse categorical features to dense vectors
  2. Bottom MLP: Transform continuous features
  3. Feature Interaction: Explicit 2nd-order interactions via dot products
  4. Top MLP: Final prediction from combined features

Key Insight: Parallel processing paths that merge at the interaction layer.


Let’s examine each component to build intuition.

Embeddings

Embeddings: Map categorical inputs to latent factor space.

  • A learned embedding matrix \(W \in \mathbb{R}^{m \times d}\) for each category of input
  • One-hot vector \(e_i\) with \(i\text{-th}\) entry 1 and rest are 0s
  • Embedding of \(e_i\) is \(i\text{-th}\) row of \(W\), i.e., \(w_i^T = e_i^T W\)

Criteo Dataset: 26 categorical features, each embedded to dimension \(d=128\).

DLRM Architecture

DLRM Architecture

Dense Features

The advantage of the DLRM architecture is that it can take continuous features as input such as the user’s age, time of day, etc.

There is a bottom MLP that transforms these dense features into a latent space of the same dimension \(d\) as the embeddings.

Criteo Dataset Configuration:

  • Input: 13 continuous (dense) features
  • Bottom MLP layers: 512 → 256 → 128
  • Output: 128-dimensional vector (matches embedding dimension)

DLRM Architecture

DLRM Architecture

Optional Sparse Feature MLPs

Optionally, one can add MLPs to transform the sparse features as well.

DLRM Architecture

DLRM Architecture

Feature Interactions

Why explicit interactions?

  • Simply concatenating features lets the MLP learn interactions implicitly
  • But explicitly computing 2nd-order interactions is more efficient and interpretable
  • Inspired by Factorization Machines from predictive analytics

How: Compute dot products of all pairs of embedding vectors and processed dense features.

Then concatenate dot products with the original dense features.

DLRM Architecture

DLRM Architecture

Feature Interactions, Continued


Dimensionality reduction: Using dot products between \(d\)-dimensional vectors yields scalars, avoiding the explosion of treating each element separately.

Criteo Example: 27 total vectors (1 from Bottom MLP + 26 embeddings)

  • Pairwise interactions: \(\binom{27}{2} = \frac{27 \times 26}{2} = 351\) dot products
  • Each dot product is a scalar (interaction strength)

Top MLP

The concatenated vector is then passed to a final MLP and then to a sigmoid function to produce the final prediction (e.g., probability score of recommendation)

This entire model is trained end-to-end using standard deep learning techniques.

Criteo Configuration:

  • Input: 506-dimensional concatenated vector
    • 128 processed dense features
    • 351 pairwise dot products
    • 27 embedding vectors (27 × 1 = 27 after pooling)
  • Top MLP layers: 512 → 256 → 1
  • Output: Sigmoid activation → click probability

DLRM Architecture

DLRM Architecture

DLRM Dimensions: Criteo Dataset Summary

Putting it all together for the Criteo Ad Kaggle dataset configuration:

Component Details
Input Features 13 dense (continuous) + 26 sparse (categorical)
Bottom MLP 13 → 512 → 256 → 128
Embeddings 26 categorical features × 128d each
Interaction Layer 27 vectors → \(\binom{27}{2} = 351\) dot products
Concatenation 128 (dense) + 351 (interactions) + 27 (pooled embeddings) = 506
Top MLP 506 → 512 → 256 → 1
Output Sigmoid(1) → Click probability

Key observation: All vectors in interaction layer are 128-dimensional, enabling efficient pairwise dot products.

The Memory Challenge

Production-scale DLRM models face unique bottlenecks:

Memory Intensive:

  • Embedding tables can be >99.9% of model memory
  • Real datasets have millions of unique IDs
  • Example: Criteo has 26 categorical features, some with 10M+ unique values
  • Total size: 100s of GB to TB

Irregular Access Patterns:

  • Embedding lookups (SparseLengthsSum) have low compute intensity
  • High cache miss rates (vs. dense operations)
  • Memory bandwidth becomes the bottleneck
  • Different from typical DNN workloads (CNNs, RNNs)

This is why DLRM requires specialized system optimizations.

Parallelization Strategy

DLRM’s size prevents simple data parallelism (can’t replicate massive embedding tables).

Solution: Hybrid Model + Data Parallelism

Model Parallelism for embeddings:

  • Distribute embedding tables across devices
  • Each device stores subset of embeddings
  • Reduces memory per device

Data Parallelism for MLPs:

  • MLPs have fewer parameters
  • Can replicate across devices
  • Process different samples in parallel

Communication: Use “butterfly shuffle” (personalized all-to-all) to gather embeddings for the interaction layer.

Training Results

Figure 2: DLRM Training Results

Figure 2 shows the training (solid) and validation (dashed) accuracies of DLRM on the Criteo Ad Kaggle dataset.

DLRM achieves comparable or better accuracy than Deep and Cross Network (DCN) (Wang et al. 2017), while being more efficient for sparse feature interactions.

Other Modern Approaches

There are many other modern approaches to recommender systems for example:

  1. Graph-Based Recommender Systems:
    • Leverage graph structures to capture relationships between users and items.
    • Use techniques like Graph Neural Networks (GNNs) to enhance recommendation accuracy.
  2. Context-Aware Recommender Systems:
    • Incorporate contextual information such as time, location, and user mood to provide more personalized recommendations.
    • Contextual data can be integrated using various machine learning models.
  1. Hybrid Recommender Systems:
    • Combine multiple recommendation techniques, such as collaborative filtering and content-based filtering, to improve performance.
    • Aim to leverage the strengths of different methods while mitigating their weaknesses.
  2. Reinforcement Learning-Based Recommender Systems:
    • Use reinforcement learning to optimize long-term user engagement and satisfaction.
    • Models learn to make sequential recommendations by interacting with users and receiving feedback.

These approaches often leverage advancements in machine learning and data processing to provide more accurate and personalized recommendations.

See (Ricci, Rokach, and Shapira 2022) for a comprehensive overview of recommender systems.

Impact of Recommender Systems

Filter Bubbles

There are a number of concerns with the widespread use of recommender systems and personalization in society.

First, recommender systems are accused of creating filter bubbles.

A filter bubble is the tendency for recommender systems to limit the variety of information presented to the user.

The concern is that a user’s past expression of interests will guide the algorithm in continuing to provide “more of the same.”

This is believed to increase polarization in society, and to reinforce confirmation bias.

Maximizing Engagement

Second, recommender systems in modern usage are often tuned to maximize engagement.

In other words, the objective function of the system is not to present the user’s most favored content, but rather the content that will be most likely to keep the user on the site.

How this works in practice:

  • Objective Functions: Models optimize for metrics like click-through rate, watch time, or session duration
  • A/B Testing: Continuous experimentation to find which content keeps users engaged longer
  • Feedback Loops: User interactions train the model to predict and serve “sticky” content

The Incentive: Sites supported by advertising revenue directly benefit from more engagement time. More engagement means more ad impressions and more revenue.

Extreme Content

However, many studies have shown that sites that strive to maximize engagement do so in large part by guiding users toward extreme content:

  • content that is shocking,
  • or feeds conspiracy theories,
  • or presents extreme views on popular topics.

Given this tendency of modern recommender systems, for a third party to create “clickbait” content such as this, one of the easiest ways is to present false claims.

Methods for addressing these issues are being very actively studied at present.

Ways of addressing these issues can be:

  • via technology
  • via public policy

Recap and References

BU CS/CDS Research

You can read about some of the work done in Professor Mark Crovella’s group on this topic:

Recap

Part I (Collaborative Filtering & Matrix Factorization):

  • Collaborative filtering (CF): user-user and item-item similarity approaches
  • Matrix factorization (MF): latent vectors and ALS optimization
  • Practical implementation of ALS on Amazon movie reviews

Part II (Deep Learning & Impact):

  • DLRM: A production-scale deep learning approach unifying RecSys and predictive analytics traditions
  • Connection between embeddings and matrix factorization latent factors
  • DLRM architecture: dual-path design with explicit feature interactions
  • System challenges: massive embedding tables (>99.9% of model memory) and irregular memory access
  • Parallelization: hybrid model + data parallelism with butterfly shuffle communication
  • Societal impact: filter bubbles, engagement maximization, and extreme content concerns

References

Naumov, Maxim et al. 2019. “Deep Learning Recommendation Model for Personalization and Recommendation Systems.” arXiv Preprint arXiv:1906.00091, May. http://arxiv.org/abs/1906.00091.
Rastegarpanah, Bashir, Krishna P. Gummadi, and Mark Crovella. 2019. “Fighting Fire with Fire: Using Antidote Data to Improve Polarization and Fairness of Recommender Systems.” In Proceedings of WSDM. http://www.cs.bu.edu/faculty/crovella/paper-archive/wsdm19-antidote-data.pdf.
Ricci, Francesco, Lior Rokach, and Bracha Shapira, eds. 2022. Recommender Systems Handbook. New York, NY: Springer US. https://doi.org/10.1007/978-1-0716-2197-4.
Spinelli, Larissa, and Mark Crovella. 2017. “Closed-Loop Opinion Formation.” In Proceedings of the 9th International ACM Web Science Conference (WebSci). http://www.cs.bu.edu/faculty/crovella/paper-archive/netsci17-filterbubble.pdf.
———. 2020. “How YouTube Leads Privacy-Seeking Users Away from Reliable Information.” In Proceedings of the Workshop on Fairness in User Modeling, Adaptation, and Personalization (FairUMAP). http://www.cs.bu.edu/faculty/crovella/paper-archive/youtube-fairumap20.pdf.
Wang, Ruoxi, Bin Fu, Gang Fu, and Mingliang Wang. 2017. “Deep & Cross Network for Ad Click Predictions.” In Proc. ADKDD, 12.
Back to top