LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders

📝 Paper Summary

Industrial Recommender Systems Sequence Modeling

LONGER enables efficient end-to-end modeling of user behavior sequences up to length 10,000 in industrial recommenders by combining token merging, hybrid attention, and system-level optimizations.

Core Problem

Modeling ultra-long user sequences (length > 1,000) is computationally prohibitive with standard Transformers due to quadratic complexity, forcing systems to use lossy two-stage retrieval or pre-trained embeddings.

Why it matters:

Ultra-long sequences capture crucial long-term interests, improving accuracy and diversity while mitigating information cocoons in recommender systems.
Current industrial practices (two-stage retrieval) create upstream-downstream inconsistency, sacrificing raw information fidelity.
Scaling laws suggest significant performance gains from longer contexts, but hardware constraints prevent direct application of vanilla architectures.

Concrete Example: A user might have clicked a niche item 5,000 interactions ago that is highly relevant to a current candidate. A standard retrieval system selecting only the top-100 recent items would miss this signal, while a full Transformer would run out of GPU memory processing the full history.

Key Novelty

GPU-Efficient End-to-End Long Sequence Modeling

Token Merge: Compresses adjacent tokens into groups using a lightweight inner-transformer to reduce sequence length while preserving local details.
Hybrid Attention: Uses a cross-attention layer to select relevant history followed by self-attention on compressed sequences, drastically reducing computational cost.
System Optimizations: Deploys KV caching to reuse user sequence computations across candidate items, plus mixed-precision training to handle massive scale.

Architecture

The overall architecture of LONGER, illustrating the data flow from input to prediction.

Evaluation Highlights

Reduces FLOPs by ~42.8% compared to vanilla Transformers while maintaining model performance through token merging.
KV Cache Serving optimization reduces online throughput degradation from -40% to only -6.8% when scaling sequence length.
Successfully deployed in dozens of scenarios at ByteDance (including Douyin), scaling user sequences to length 10,000 in production.

Breakthrough Assessment

8/10

Significant industrial contribution. Successfully scales end-to-end long-sequence modeling (10k length) to billion-user production, solving major efficiency bottlenecks that previously forced suboptimal two-stage approaches.

⚙️ Technical Details

Problem Definition

Setting: Click-Through Rate (CTR) and Conversion Rate (CVR) prediction in recommender systems

Inputs: User u with behavior sequence S_u = [i_1...i_L] (where L can be >= 2000), user features u_d, and target item v

Outputs: Predicted probability y_hat that user u interacts with item v

Pipeline Flow

Input Processing: Global Tokens + Sequence Tokens
Token Merge (Compression)
Hybrid Attention Modeling
Prediction Head

System Modules

Global Token Generator

Create anchor representations (target item, UID) that have full receptive field

Model or implementation: Embedding Lookup

Token Merge Module

Compress long sequence by factor K to reduce complexity

Model or implementation: InnerTransformer (lightweight)

Hybrid Attention Block

Capture dependencies between target item and history, and within history

Model or implementation: Transformer Layers

KV Cache Serving

Reuse user sequence computations across multiple candidates

Model or implementation: Cache Mechanism

Novel Architectural Elements

InnerTrans-based Token Merge: Inserting a mini-transformer inside the compression step to preserve local semantics better than simple pooling.
Hybrid Attention Topology: Specific stack of 1 Cross-Attention layer followed by N Self-Attention layers to balance global relevance and internal sequence modeling.
Unified Dense/Sparse Synchronous Training: Architecture colocating parameter storage and computation on GPU to eliminate Parameter Server bottlenecks.

Modeling

Base Model: Custom Transformer-based Architecture (LONGER)

Trainable Parameters: Full end-to-end training (Embeddings + Transformer Layers + MLP)

Key Hyperparameters:

sequence_length: Up to 10,000 (implied by title/intro scaling claims)
merge_factor_K: 4 (typical setting)
embedding_dimension: 32
+ 1 more
precision: BF16/FP16 (Mixed Precision)

Compute: Deployed on ultra-large-scale GPU clusters. Uses activation recomputation to save memory.

Comparison to Prior Work

vs. SIM/TWIN: LONGER is end-to-end and differentiable, avoiding the information loss of hard retrieval steps.
vs. MIMN: LONGER keeps the sequence structure rather than compressing into fixed memory slots, allowing better temporal modeling.
vs. HSTU: LONGER introduces Token Merge and Hybrid Attention specifically to reduce the quadratic cost for ultra-long sequences, whereas HSTU focuses on efficient attention implementation.
+ 1 more
vs. Linformer [not cited in paper]: LONGER uses domain-specific token merging (InnerTrans) rather than low-rank projections generic to NLP.

Limitations

Relies on massive industrial GPU infrastructure; likely difficult to replicate in academic settings.
Specifics of the 'InnerTrans' architecture (layers, heads) are relatively sparse.
Evaluation is primarily on proprietary datasets/platforms; no public benchmarks reported.

Reproducibility

Proprietary industrial system. Code is not provided. Dataset is a proprietary billion-scale industrial dataset. Architecture and engineering principles are described, but replication requires significant infrastructure.

📊 Experiments & Results

Evaluation Setup

Industrial recommendation (CTR/CVR prediction) on offline logs and online A/B tests.

Benchmarks:

Industrial Dataset (CTR Prediction) [New]
Online A/B Testing (Live Recommendation)

Metrics:

AUC (Area Under ROC Curve)
LogLoss
FLOPs (Computational Cost)
Throughput / Latency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Computational efficiency analysis shows LONGER significantly reduces FLOPs compared to vanilla Transformers.
Theoretical Complexity	FLOPs	587	336	-251
Online serving performance demonstrates the effectiveness of the KV Cache optimization.
Online Serving	Throughput Degradation	-40.0	-6.8	+33.2
Optimization techniques (Mixed Precision) improve training efficiency metrics.
Training Efficiency	Throughput Improvement	0	18	+18
Training Efficiency	Memory Usage Reduction	0	18	-18

Experiment Figures

Illustration of the KV Cache Serving strategy.

Main Takeaways

Token merging with K=4 reduces computational load by ~43% with minimal impact on capability.
KV Cache is critical for industrial deployment, effectively decoupling user sequence encoding from candidate scoring.
System-level optimizations (mixed precision, synchronous training) provide double-digit percentage gains in throughput and memory efficiency.
The method scales effectively to sequence lengths of 10,000, validated in large-scale online A/B tests (though specific lift numbers for A/B tests are not extracted).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, Cross-Attention)
Recommender Systems (CTR prediction, User Behavior Sequences)
GPU training optimizations (Mixed Precision, Activation Recomputation)

Key Terms

Two-stage retrieval: A standard industrial practice where a lightweight model first selects a small subset of items (e.g., top-100) from a long history, which are then fed to a heavy ranking model.

KV Cache: Key-Value Cache—storing precomputed attention representations of the user sequence so they don't need to be recalculated for every candidate item during ranking.

Token Merge: A technique to reduce sequence length by grouping adjacent tokens and representing them as a single vector, often using a small local model (InnerTrans).

FLOPs: Floating Point Operations—a measure of computational cost.

Mixed Precision: Training using lower-precision numerical formats (like BF16 or FP16) to save memory and speed up computation without significant accuracy loss.

Activation Recomputation: A memory-saving technique where intermediate activations are discarded during the forward pass and re-calculated during the backward pass to fit larger models on GPUs.

Global Tokens: Special tokens (like CLS or target item) added to the sequence that can attend to everything, acting as anchors for information aggregation.

Cross-Attention: Attention mechanism where the query comes from one source (e.g., target item) and keys/values come from another (e.g., user history).

Self-Attention: Attention mechanism where queries, keys, and values all come from the same sequence, capturing internal dependencies.