Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

📝 Paper Summary

LLMs for Recommendation Scaling Laws Synthetic Data Generation

The paper establishes the first power-law scaling for recommender LLMs by replacing noisy, biased user logs with a layered synthetic curriculum that decouples semantics, collaborative patterns, and sequential behavior.

Core Problem

LLMs for recommendation fail to exhibit predictable scaling laws because raw user interaction logs are sparse, noisy, and riddled with systemic biases (popularity, position, exposure bias).

Why it matters:

Scaling laws are essential for optimizing substantial investments in data and compute, yet none exist for recommendation CPT
Training on biased logs causes models to internalize and amplify system flaws rather than learning true user preferences
Prior attempts (e.g., PLUM) showed 'sub-scaling' where larger models (3B) failed to consistently outperform smaller ones (900M) due to data deficiencies

Concrete Example: A user clicks the first item in a list simply because it was shown first (Position Bias). If trained on this raw log, an LLM learns to recommend items based on screen position rather than user preference, reinforcing the bias. The paper replaces this with synthetic 'unbiased' walks on a user-item graph.

Key Novelty

Layered Synthetic Curriculum for Recommendation CPT

Deconstructs recommendation data into two clean layers: (1) item-text alignment and collaborative filtering rules to teach foundational knowledge, and (2) unbiased synthetic user interaction histories
Generates synthetic interaction histories via graph-based random walks that simulate user behavior without the position or popularity biases inherent in real logs
Discovers that this principled data enables robust power-law scaling (L = L_inf + A*D^-alpha) where raw data failed

Evaluation Highlights

+130% improvement on Recall@100 for SASRec trained on synthetic data vs. real data, proving synthetic patterns are more generalizable
Establishes first robust scaling laws for Rec-LLMs (0.6B to 8B params), with User Interaction History data showing strongest scaling (alpha approx 0.45-0.59)
Asymmetric transfer: Adding Collaborative Filtering data reduces asymptotic loss on User Interaction History tasks by 31% (L_inf drops from 0.95 to 0.66), while reverse transfer does not hold

Breakthrough Assessment

9/10

First successful demonstration of scaling laws in recommendation, a major open problem. The 130% gain over real data and the discovery of asymmetric data synergy are highly significant.

⚙️ Technical Details

Problem Definition

Setting: Continual Pre-training (CPT) of LLMs for sequential recommendation tasks

Inputs: Synthetic sequences of user-item interactions and item descriptions

Outputs: Next-item prediction (generative recommendation)

Pipeline Flow

Data Layer 1 Generation: Item-Text Alignment & Collaborative Filtering
Data Layer 2 Generation: Synthetic Unbiased User Interaction Histories
Continual Pre-training: Training LLM on curriculum
Evaluation: Downstream ranking & Scaling law analysis

System Modules

Layer 1: Foundational Knowledge (Data Generation)

Teach item semantics and co-occurrence patterns

Model or implementation: Synthetic Data Generator

Layer 2: Behavioral Simulation (Data Generation)

Generate unbiased sequential user histories

Model or implementation: Graph Random Walk Generator

Recommender LLM

Predict next item in sequence

Model or implementation: Transformer-based LLM (0.6B to 8B parameters)

Novel Architectural Elements

Layered Synthetic Data Curriculum: A structured data pipeline that decouples semantics, collaborative signals, and sequential behavior to enable predictable scaling

Modeling

Base Model: Transformer-based LLMs (Parameters: 0.6B, 1B, up to 8B)

Training Method: Continual Pre-training (CPT)

Objective Functions:

Purpose: Predict the next token in the sequence to learn user preferences.

Formally: Standard Causal Language Modeling (CLM) loss.

Training Data:

Total tokens: 163B
Mixture: Item-Text Alignment, Collaborative Filtering, User Interaction History (UIH)

Key Hyperparameters:

training_tokens: 163B
model_scales: ['0.6B', '1B', '8B']

Compute: Not reported in the paper

Comparison to Prior Work

vs. PLUM: Uses principled synthetic data instead of raw logs; demonstrates robust scaling where PLUM failed (sub-scaling)
vs. LUM: Data-centric vs. Model-centric; argues architectural fixes can't overcome fundamental data flaws
vs. RecGAN [not cited in paper]: Focuses on 'pedagogical' data creation to teach principles, rather than GANs which mimic statistical properties of flawed real data

Limitations

No statistical significance tests reported
Relies entirely on synthetic data quality; generation process details are high-level
Computational cost of generating 163B synthetic tokens not detailed

Reproducibility

No replication artifacts mentioned in the paper. Code URL, specific prompts for data generation, and trained weights are not provided.

📊 Experiments & Results

Evaluation Setup

Downstream sequential recommendation ranking and Scaling Law analysis

Benchmarks:

Standard Sequential Models (SASRec, GRU4Rec, NARM, STAMP) (Sequential Recommendation)
Scaling Law Analysis (Perplexity Scaling across 7 domains) [New]

Metrics:

Recall@10, Recall@100, Recall@1000
Perplexity (Scaling Law Loss)
Scaling Exponent (alpha)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SASRec (Recall@100)	Recall@100	Not reported in the paper	Not reported in the paper	Not reported in the paper
Scaling law analysis shows consistent power-law behavior across data modalities, with User Interaction History (UIH) showing the steepest learning curve.
Synthetic Data Scaling	Scaling Exponent (alpha)	0.15	0.59	+0.44
Synthetic Data Scaling	Scaling Exponent (alpha)	0.35	0.59	+0.24
Ablation studies reveal asymmetric synergy: CF data helps UIH learning significantly, but UIH data does not help CF learning.
UIH Loss Convergence	L_inf (Asymptotic Loss)	0.95	0.66	-0.29

Main Takeaways

Standard sequential models (SASRec, etc.) trained on principled synthetic data outperform those trained on real data (e.g., +130% Recall@100 for SASRec), indicating synthetic data captures better generalizable patterns.
First demonstration of robust power-law scaling (L = L_inf + A*D^-alpha) for LLMs in recommendation, validating the data-centric approach.
Clear hierarchy of learning efficiency: UIH data scales best (alpha ~ 0.45-0.59), followed by CF (0.35), then Item-Text alignment (0.15).
Asymmetric data synergy: Collaborative Filtering data serves as a critical pedagogical foundation for learning sequential user history, significantly lowering asymptotic loss, whereas the reverse is not true.

📚 Prerequisite Knowledge

Prerequisites

Scaling laws (Kaplan et al., Chinchilla)
Sequential Recommendation (SASRec, GRU4Rec)
Large Language Model Pre-training
Collaborative Filtering concepts

Key Terms

CPT: Continual Pre-training—further training a base LLM on domain-specific data to adapt it to a new task

User Interaction History (UIH): A sequence of items a user has interacted with, used to predict future interests

Collaborative Filtering (CF): Recommendation method based on patterns of interactions (e.g., users who bought X also bought Y)

Power-law scaling: A mathematical relationship where model performance improves at a fixed rate (exponent alpha) as resources (data/compute) increase exponentially

Recall@K: Evaluation metric measuring if the true target item appears in the top K recommendations

Scaling exponent (alpha): The rate at which loss decreases as dataset size increases; higher alpha means faster learning

L_inf: Irreducible loss; the theoretical best performance a model can achieve with infinite data

Asymptotic loss: The theoretical minimum loss a model approaches as training data becomes infinite

Perplexity: A measurement of how well a probability model predicts a sample; lower is better