Wukong: Towards a Scaling Law for Large-Scale Recommendation

📝 Paper Summary

Deep Learning Recommendation Systems (DLRS) Click-Through Rate (CTR) Prediction Scaling Laws

Wukong establishes a scaling law for recommendation systems by replacing standard embedding-table scaling with a stacked Factorization Machine architecture that captures any-order interactions through deeper and wider layers.

Core Problem

Current recommendation models rely on 'sparse scaling' (expanding embedding tables) to improve quality, which fails to capture complex feature interactions and cannot efficiently utilize modern hardware compute capacity.

Why it matters:

Sparse scaling leads to massive parameter counts (trillions) dominated by memory-bound embedding tables, causing prohibitive infrastructure costs
Existing interaction architectures (DLRM, DCNv2) lack effective mechanisms for 'dense scaling' (adding compute/layers), showing diminishing returns or instability when scaled up
Modern hardware accelerators improve primarily in compute capacity, which embedding lookups cannot utilize effectively

Concrete Example: DLRM captures only 2nd-order interactions and cannot scale depth effectively. When scaling up complexity beyond 100 GFLOP/example, prior arts fall short in quality improvements, whereas Wukong continues to improve.

Key Novelty

Stacked Factorization Machines (Wukong)

Uses a multi-layer stack where each layer contains a Factorization Machine Block (FMB) and a Linear Compression Block (LCB), conceptually inspired by binary exponentiation
By stacking FMs, the system captures exponentially higher-order interactions (layer i captures orders 1 to 2^i) without the computational cost of explicit high-order tensor products
Replaces the standard 'dot product then MLP' paradigm with a recursive interaction structure where embeddings are updated layer-by-layer

Architecture

The complete Wukong architecture, detailing the Embedding Layer, Interaction Stack, and internal structure of the Interaction Layer (FMB + LCB).

Evaluation Highlights

Outperforms state-of-the-art models (DCNv2, FinalMLP, MaskNet) across all six public datasets in terms of AUC
Scales effectively on a large-scale internal dataset up to >100 GFLOP/example, maintaining quality gains where baselines saturate or degrade
Reduces interaction complexity from O(n^2) to O(nk) using a low-rank optimized FM formulation

Breakthrough Assessment

8/10

Significant architectural shift from sparse to dense scaling in recommendation systems. Successfully demonstrates scaling laws in a domain where they have been elusive, addressing a major industry bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Click-Through Rate (CTR) prediction for recommendation

Inputs: Mixture of continuous dense features and categorical sparse features

Outputs: Probability of user interaction (e.g., click)

Pipeline Flow

Embedding Layer (Sparse & Dense features → Unified Embeddings)
Interaction Stack (Layer 1 → Layer 2 → ... → Layer L)
Prediction Layer (MLP → Logits)

System Modules

Embedding Layer

Transform categorical inputs to dense vectors and project continuous features to same dimension

Model or implementation: Lookup Tables + MLP projections

Interaction Stack (Interaction Modeling)

Capture progressively higher-order feature interactions via stacked layers

Model or implementation: Stack of L identical layers (FMB + LCB)

Factorization Machine Block (FMB) (Interaction Modeling)

Capture 2nd-order interactions of input embeddings and project back to embedding space

Model or implementation: Optimized FM + MLP

Linear Compression Block (LCB) (Interaction Modeling)

Linearly recombine embeddings to preserve existing interaction orders

Model or implementation: Linear projection matrix W_L

Prediction Layer

Map final interaction representations to prediction

Model or implementation: MLP

Novel Architectural Elements

Stacked Factorization Machines (FMBs) configured to capture exponential interaction orders (binary exponentiation analogy)
Parallel Linear Compression Block (LCB) to maintain interaction order invariance across layers
Low-rank Optimized FM formulation within blocks to enable scaling to large feature counts

Modeling

Base Model: Wukong

Key Hyperparameters:

d (embedding dimension): Global embedding dimension (not specified value)
l (layers): Number of layers in Interaction Stack
n_F: Number of embeddings generated by FMB
+ 3 more
n_L: Number of embeddings generated by LCB
k: Number of compressed embeddings in optimized FM
MLP_size: Hidden size h

Compute: Scales beyond 100 GFLOP/example in experiments

Comparison to Prior Work

vs. DLRM: Wukong captures >2nd order interactions via stacking; DLRM is limited to 2nd order.
vs. DCNv2: Wukong treats embeddings as units (vector-wise) rather than element-wise, reducing compute; Wukong scales effectively where DCNv2 saturates.
vs. xDeepFM: Wukong avoids costly outer products via the stacked FM + MLP design.
+ 2 more
vs. HOFM: Wukong captures high orders exponentially (layer i → 2^i) rather than linearly/explicitly.
vs. Transformers (AutoInt+): Wukong uses FM-based blocks which are more specialized for feature interaction than generic self-attention [implied].

Limitations

Computational cost grows with number of layers and embedding size (though O(nk) is better than O(n^2))
Internal dataset results cannot be independently verified
Primary focus is on interaction architecture; sparse scaling (embedding tables) is treated orthogonally
No specific code provided for reproduction

Reproducibility

Code availability is not provided. Internal dataset is proprietary. Public datasets are standard (Criteo, Avazu, etc.) but specific preprocessing details are necessary for exact replication.

📊 Experiments & Results

Evaluation Setup

CTR prediction on public and proprietary datasets

Benchmarks:

Criteo (CTR Prediction)
Avazu (CTR Prediction)
Movielens-1M (Rating/CTR)
Frappe (Context-aware recommendation)
Internal Large-Scale Dataset (Industrial Recommendation) [New]

Metrics:

AUC (Area Under ROC)
LogLoss
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Wukong consistently achieves the highest AUC across all 6 public datasets compared to state-of-the-art baselines.
Criteo	AUC	0.8146	0.8157	+0.0011
Avazu	AUC	0.7871	0.7903	+0.0032
Movielens-1M	AUC	0.9328	0.9388	+0.0060
Frappe	AUC	0.9829	0.9856	+0.0027
Scalability experiments on internal data show Wukong continues to improve with complexity, unlike baselines.

Experiment Figures

Figure 3 (implied from context about scaling)

Scaling curves showing Model Quality (AUC/LogLoss) vs. Compute Complexity (GFLOP/example) for Wukong and baselines.

Main Takeaways

Establishes a 'dense scaling law' for recommendation: Wukong's quality improves log-linearly with compute (GFLOPs) across two orders of magnitude.
Optimized FM formulation (low-rank projection) effectively reduces compute/memory cost without sacrificing interaction quality.
Wukong outperforms strong baselines (DCNv2, MaskNet, FinalMLP) on widely diverse datasets, demonstrating robustness.
Traditional baselines like DCNv2 and DLRM saturate or degrade in quality when scaled up to high FLOP counts (>100 GFLOP/example), whereas Wukong maintains an upward trend.

📚 Prerequisite Knowledge

Prerequisites

Factorization Machines (FM)
Deep Learning Recommendation Models (DLRM)
Embedding tables
Scaling laws (as seen in LLMs)

Key Terms

DLRS: Deep Learning Recommendation Systems—neural networks designed to rank content for users

sparse scaling: Improving model quality by increasing the size of embedding tables (memory-intensive)

dense scaling: Improving model quality by increasing the depth/width of interaction layers (compute-intensive)

FMB: Factorization Machine Block—a module in Wukong that captures 2nd-order interactions of its inputs

LCB: Linear Compression Block—a module in Wukong that linearly transforms inputs to preserve lower-order information

interaction order: The number of features combined in a single term (e.g., 2nd order = x_i * x_j, 3rd order = x_i * x_j * x_k)

GFLOP: Giga Floating Point Operations—a measure of computational complexity

AUC: Area Under the Curve—a standard metric for binary classification performance

DLRM: Deep Learning Recommendation Model—a standard baseline architecture using dot products for interactions

DCNv2: Deep & Cross Network v2—a baseline model using explicit feature crossing layers