LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation

📝 Paper Summary

Sequential Recommendation Scaling Laws Click-Through Rate (CTR) Prediction

LLaTTE establishes that ads recommendation follows LLM-like scaling laws when semantic features are used, enabling a two-stage architecture that offloads heavy sequence processing to an asynchronous upstream model.

Core Problem

Production recommendation systems are constrained to shallow models by strict latency budgets and rely on sparse ID features that plateau quickly, preventing them from exploiting the power-law scaling seen in LLMs.

Why it matters:

Current industrial systems fail to capture long-term user intent because they cannot process long sequence histories (thousands of actions) in real-time
There is a gap between research (deep transformers) and production (shallow FM-based models), limiting the adoption of scaling laws in revenue-critical systems

Concrete Example: A standard ID-based model might learn that a user clicked 'shoes', but fails to scale performance with added depth. LLaTTE uses semantic features (content embeddings) so that adding layers continuously improves predictions about the user's intent to buy 'running gear' based on a 5000-action history.

Key Novelty

Two-Stage Semantic Scaling Paradigm

Demonstrates that semantic features (content embeddings) are a prerequisite for scaling, effectively 'bending the curve' to allow deeper models to continue improving where ID-only models plateau
Splits inference into an asynchronous 'Upstream' stage (massive, processes long history) and a synchronous 'Online' stage (lightweight, fuses upstream signal), preserving scaling gains under latency constraints

Architecture

The LLaTTE architecture diagram showing the interaction between the Sequence Module (Transformer) and Non-Sequence Module (DHEN), and the two-stage deployment.

Evaluation Highlights

+4.3% Conversion Rate (CVR) uplift on Facebook Feed and Reels in live production experiments
+0.25% Normalized Entropy (NE) improvement on primary revenue-generating models (significant in mature ads systems)
Achieves a ~50% Transfer Ratio, meaning half of the theoretical gain from the massive upstream model is successfully preserved in the constrained online environment

Breakthrough Assessment

9/10

Establishment of empirical scaling laws for industrial recommendation and the successful deployment of a massive multi-stage transformer at Meta's scale represents a significant operational and theoretical advance.

⚙️ Technical Details

Problem Definition

Setting: Multi-task ads ranking predicting engagement probabilities (CTR, CVR) given user sequence, ad, and context.

Inputs: User sequence S_u (actions with timestamps/IDs/content), Ad features x_i, Context x_c

Outputs: Vector of probabilities for engagement events (e.g., click, conversion)

Pipeline Flow

Upstream Stage: Long Sequence -> Large LLaTTE -> User Embedding (Cached)
Online Stage: Recent Sequence + Cached Embedding + Context -> Small LLaTTE + Non-Sequence Module -> Prediction

System Modules

Upstream Sequence Encoder

Process long-term user history to generate compressed user representations

Model or implementation: Large LLaTTE (Transformer with MLA)

Online Sequence Encoder (Real-time Inference)

Process recent short-term history and fuse with upstream embedding

Model or implementation: Small LLaTTE (Transformer with MLA + Pyramidal Reduction)

Non-Sequence Backbone (Real-time Inference)

Process static user/ad/context features and interact with sequence summary

Model or implementation: DHEN (Deep Ensemble Network)

Task Heads (Real-time Inference)

Generate final probability scores

Model or implementation: Shallow MLPs

Novel Architectural Elements

Target-Aware Adaptive Transformer combining MLA (Multi-head Latent Attention) with Pyramidal Output reduction
Asymmetric Two-Stage Architecture where Upstream/Online models share architecture but differ in scale (>45x FLOPs difference) and context window

Modeling

Base Model: LLaTTE (Custom Transformer variant)

Training Method: Multi-task Supervised Learning

Objective Functions:

Purpose: Optimize for click and conversion probability accuracy.

Formally: Weighted Multi-task Binary Cross-Entropy Loss

Key Hyperparameters:

sequence_length_T: 500 to 5000
upstream_flop_multiplier: >45x (relative to online sequence module)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TransAct/Pinnerformer: LLaTTE systematically applies scaling laws to both stages and uses a unified architecture rather than disparate modules
vs. SASRec/BERT4Rec: Integrates non-sequence features (semantic/dense) directly into the scaling formulation, proving they are prerequisites for deep scaling
vs. HSTU/OneTrans: Decouples the sequence module to allow independent scaling of upstream/online components, whereas HSTU/OneTrans advocate for pure sequence architectures [coupled design]

Limitations

Relies on high-quality semantic features; scaling benefits diminish with ID-only features
Requires complex asynchronous infrastructure to manage the upstream/online handoff
Model width acts as a bottleneck; depth scaling is ineffective until sufficient width is established

Reproducibility

No replication artifacts mentioned in the paper. The system is deployed at Meta (proprietary data and infrastructure).

📊 Experiments & Results

Evaluation Setup

Large-scale industrial ads ranking on Meta production traffic

Benchmarks:

Internal Meta Production Data (Ads CTR/CVR Prediction) [New]

Metrics:

Normalized Entropy (NE)
Conversion Rate (CVR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Production deployment results demonstrating real-world impact.
Facebook Feed and Reels	Conversion Rate (CVR) Uplift	0.0	4.3	+4.3
Internal Test Set	Normalized Entropy (NE) Improvement	0.00	0.25	+0.25
Analysis of the Transfer Ratio, measuring efficiency of the multi-stage architecture.
Internal Test Set	Transfer Ratio	100	50	-50

Main Takeaways

Semantic features are not just additive but multiplicative; they 'bend the scaling curve', increasing the scaling exponent and preventing the plateau seen with ID-only features
Sequence modeling in recommendation follows predictable log-linear scaling laws with compute, similar to LLMs
Model width is a capacity bottleneck; a critical width threshold must be met before scaling depth yields returns
The two-stage architecture successfully bridges the gap between massive offline compute (>45x FLOPs) and strict online latency limits

📚 Prerequisite Knowledge

Prerequisites

Transformer architectures (Self-Attention)
Factorization Machines (FM) / Deep Learning Recommendation Models (DLRM)
Scaling Laws (Power laws in neural networks)
Asynchronous Inference

Key Terms

LLaTTE: LLM-Style Latent Transformers for Temporal Events—the paper's proposed transformer architecture designed for efficiency and scaling

MLA: Multi-head Latent Attention—a memory-efficient attention mechanism (from DeepSeek) that compresses Key-Value heads into a latent vector

NE: Normalized Entropy—a standard metric for ads ranking (average log loss normalized by the entropy of the background click rate)

DHEN: Deep Heterogeneous Ensemble Network—a non-sequence backbone architecture used for processing static sparse and dense features

Upstream Model: A large, asynchronous model that processes long user histories to generate cached user embeddings, not bound by request-time latency

Online Model: A smaller, synchronous model that serves real-time requests using cached upstream embeddings and recent user actions

Transfer Ratio: A metric quantifying how much of the performance gain from the large upstream model is preserved when its output is used by the smaller online model

Pyramidal Reduction: An architectural optimization that selectively drops older tokens at deeper transformer layers to reduce computation