RankMixer: Scaling Up Ranking Models in Industrial Recommenders

📝 Paper Summary

Industrial Recommendation Systems CTR Prediction / Ranking

RankMixer scales industrial ranking models to one billion parameters by replacing inefficient CPU-era modules with a hardware-aware architecture using multi-head token mixing and per-token feed-forward networks.

Core Problem

Traditional ranking models use handcrafted feature-crossing modules inherited from the CPU era, which suffer from extremely low Model Flops Utilization (MFU) on modern GPUs and fail to scale effectively.

Why it matters:

Industrial recommenders must adhere to strict latency bounds and high QPS, making inefficient scaling viable only if computational cost is managed
Existing scaling attempts often yield modest or negative gains because they just widen layers without addressing the memory-bound nature of heterogeneous feature interactions
Standard Transformers (self-attention) are suboptimal for recommendation due to the difficulty of computing inner products between heterogeneous feature spaces (e.g., user vs. item IDs)

Concrete Example: Previous baseline models at ByteDance achieved only ~4.5% MFU on GPUs because their operators were memory-bound. Scaling them up linearly increased latency beyond acceptable limits, preventing the realization of scaling laws seen in NLP.

Key Novelty

Hardware-Aware RankMixer Architecture

Replaces quadratic self-attention with a parameter-free Multi-Head Token Mixing module to handle cross-token interactions without expensive inner-product calculations
Uses Per-Token Feed-Forward Networks (FFNs) to isolate parameters for different feature subspaces, preventing high-frequency features from dominating the learning process
Extends to a Sparse Mixture-of-Experts (MoE) variant with a dynamic routing strategy to scale capacity to 1 billion parameters while keeping inference cost constant

Architecture

The RankMixer architecture block processing T tokens

Evaluation Highlights

Boosted Model Flops Utilization (MFU) from 4.5% to 45% by replacing handcrafted modules with the RankMixer architecture
Scaled online ranking model parameters by 70x (up to 1 billion) without increasing inference latency or serving cost
Achieved +0.3% user active days and +1.08% total in-app usage duration in full-traffic A/B testing on Douyin Feed Recommendation

Breakthrough Assessment

8/10

Significant industrial breakthrough proving scaling laws in recommendation systems. Successfully deployed a 1B parameter model in a high-QPS production environment by radically optimizing hardware utilization (MFU).

⚙️ Technical Details

Problem Definition

Setting: Click-Through Rate (CTR) prediction in large-scale industrial recommendation

Inputs: Multi-field feature data including User IDs, Item IDs, context features, and sequence features, tokenized into T feature tokens

Outputs: Probability of user interaction (ranking score)

Pipeline Flow

Input Layer: Feature Embedding & Semantic Grouping
Tokenization: Alignment into T tokens
RankMixer Blocks (L layers): Token Mixing → Per-Token FFN/MoE
Output Pooling & Prediction

System Modules

Feature Tokenization

Groups heterogeneous features into semantically coherent clusters and projects them into fixed-dimension tokens

Model or implementation: Linear Projection

Multi-Head Token Mixing

Facilitates information exchange across tokens using parameter-free shuffling and projection

Model or implementation: Splitting, Shuffling, and Projection

Per-Token FFN (or MoE)

Processes each token with dedicated parameters to model distinct feature subspaces

Model or implementation: Independent MLPs per token position (or Sparse MoE)

Novel Architectural Elements

Per-token FFNs: Dedicating separate MLP parameters for each feature token position instead of sharing weights like standard Transformers
Multi-head Token Mixing: A parameter-free operator for cross-token interaction that replaces self-attention to reduce memory bottlenecks
Hybrid Dense/MoE scaling: Seamless transition from dense Per-token FFNs to Sparse-MoE for 1B+ parameter scaling

Modeling

Base Model: RankMixer (Custom Transformer-like architecture)

Training Method: Standard supervised learning for CTR prediction (and MoE routing training)

Objective Functions:

Purpose: Ensure balanced expert usage in MoE.

Formally: L_reg with coefficient lambda to keep average active-expert ratio near budget
Purpose: CTR prediction accuracy.

Formally: Standard Cross-Entropy / Log Loss (implied for ranking)

Training Data:

Trillion-scale production dataset from Douyin

Key Hyperparameters:

model_parameters: Scaled up to 1 Billion
MFU_improvement: From 4.5% to 45%

Compute: Deployed on GPUs (inference latency maintained despite 70x parameter increase)

Comparison to Prior Work

vs. Transformer: RankMixer replaces self-attention with Token Mixing to handle heterogeneous feature spaces and improve GPU MFU
vs. DLRM: RankMixer introduces Per-token FFNs to prevent high-frequency feature domination, unlike shared MLPs in DLRM
vs. DHEN: RankMixer focuses on hardware-aware design (MFU) rather than just complex architecture search

Limitations

Tokenization strategy relies on domain knowledge for semantic grouping, which may not be fully automated
Specifics of the dynamic routing strategy for MoE are briefly described but lack extensive ablation in the text
Evaluation is primarily on proprietary industrial datasets; reproducibility for outside researchers is limited

Reproducibility

No replication artifacts mentioned in the paper. The system is proprietary to ByteDance (Douyin).

📊 Experiments & Results

Evaluation Setup

Industrial recommendation ranking (Click-Through Rate prediction)

Benchmarks:

Douyin Feed Recommendation (Video Recommendation (Online A/B Test))
Douyin Advertisement (Ad Click Prediction (Online A/B Test))

Metrics:

Active Days
App Duration
MFU (Model Flops Utilization)
Inference Latency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Online A/B tests demonstrate significant user engagement gains from deployment.
Douyin Feed Recommendation	App Duration	0.00	1.08	+1.08
Douyin Feed Recommendation	Active Days	0.00	0.30	+0.30
Hardware efficiency improvements allow massive scaling.
Production Serving	MFU (Model Flops Utilization)	4.5	45.0	+40.5
Production Serving	Parameter Scale	1.0	70.0	+69.0

Main Takeaways

Decoupling parameter growth from FLOPs allows scaling models by 100x while maintaining latency constraints
Hardware-aware design (RankMixer) is crucial for industrial recommenders; merely stacking layers (as in NLP) fails due to strict latency/QPS requirements
Per-token FFNs effectively model heterogeneous feature spaces better than shared parameters, preventing dominant features from overshadowing tail features

📚 Prerequisite Knowledge

Prerequisites

Deep Learning Recommendation Models (DLRM)
Transformer architecture (Self-Attention, FFN)
Mixture-of-Experts (MoE)
Model Flops Utilization (MFU)

Key Terms

MFU: Model Flops Utilization—the ratio of the actual floating-point operations performed by the model per second to the theoretical peak performance of the hardware

DLRM: Deep Learning Recommendation Model—a standard architecture for recommendation combining embedding layers for categorical features and dense layers for numerical features

Token Mixing: A mechanism to mix information across different tokens (features) without using pair-wise attention scores, often using simple projection or shuffling

MoE: Mixture-of-Experts—a neural network architecture where different subsets of the network (experts) are activated for different inputs to increase capacity without increasing inference cost

QPS: Queries Per Second—a measure of the throughput of a serving system

PFFN: Per-Token Feed-Forward Network—a design where each token position has its own independent set of FFN parameters, rather than sharing weights across all tokens

ROI: Return on Investment—in this context, the performance gain achieved per unit of additional computational cost or latency

Quantization: The process of mapping input values from a large set (like 32-bit floats) to output values in a smaller set (like 8-bit integers) to reduce model size and speed up computation