Quantized Inference for OneRec-V2

📝 Paper Summary

Generative Recommendation Model Quantization Inference Optimization

OneRec-V2's LLM-like numerical stability enables FP8 quantization and optimized inference infrastructure to nearly double throughput and halve latency without degrading recommendation quality.

Core Problem

Reliably applying low-precision quantization to industrial recommender systems is difficult because traditional models exhibit high-variance weights and memory-bound workloads that limit hardware utilization.

Why it matters:

Traditional rankers are sensitive to quantization noise due to extreme outlier values, making low-precision inference risky
Recommendation workloads are often memory-bound, meaning faster low-precision compute units (like Tensor Cores) often sit idle, yielding minimal speedups
Industrial scale requires maximizing throughput per GPU to control serving costs while maintaining strict latency SLAs

Concrete Example: A traditional ranking model might have weight variances around 10^7 and absolute maximums over 1000, causing severe errors when rounded to FP8. In contrast, OneRec-V2 weights have variance < 0.1, similar to LLMs like Qwen3-8B.

Key Novelty

FP8 Inference for Generative Recommendation

Empirically establishes that generative recommenders (OneRec-V2) have 'behaved' numerical statistics (low variance/magnitude) resembling LLMs, unlike chaotic traditional rankers
Implements a specialized FP8 post-training quantization framework using block-wise scaling for MoE and dynamic scaling for activations to preserve accuracy
Integrates quantization with a custom inference stack (RecoGEM) that fuses operators to shift the workload from memory-bound to compute-bound

Architecture

Comparison of standard FP16 Matrix Multiplication vs. the proposed FP8 Quantized Matrix Multiplication data flow.

Evaluation Highlights

49% reduction in end-to-end inference latency (139ms → 70ms) compared to FP16 baseline
92% increase in throughput (205 → 394 queries/sec) on production workloads
Zero degradation in core metrics confirmed via extensive online A/B testing

Breakthrough Assessment

8/10

Significant industrial contribution proving generative recommenders bridge the gap to LLM-style efficiency optimizations. Strong results (2x throughput) on production-scale models.

⚙️ Technical Details

Problem Definition

Setting: Low-latency inference for a large-scale generative recommendation model

Inputs: User history and candidate item features

Outputs: Ranked list or generated sequence of recommended items

Pipeline Flow

Group: Quantized Computation (Linear Layers, Attention, MoE)
Group: Infrastructure Optimization (TopK, Kernel Fusion)

System Modules

Linear Layers (Attention/FFN) (Quantized Computation)

Perform dense projections and transformations using reduced precision

Model or implementation: OneRec-V2 Backbone

Sparse MoE (Quantized Computation)

Execute conditional computation via grouped GEMMs

Model or implementation: OneRec-V2 Backbone

RecoGEM Engine

Manage execution graph, memory layout, and kernel selection (replacing TensorRT default parsing)

Model or implementation: Inference Engine

Novel Architectural Elements

Direct TensorRT graph construction via RecoGEM library (bypassing ONNX)
Integration of block-wise FP8 quantization specifically aligned with MoE expert routing structures

Modeling

Base Model: OneRec-V2 (Fat-MoE architecture)

Training Method: Post-Training Quantization (PTQ)

Adaptation: FP8 Quantization

Trainable Parameters: Approximately 4B backbone parameters (0.5B activated per token)

Compute: Inference Latency: 70ms (optimized) vs 139ms (baseline)

Comparison to Prior Work

vs. Traditional RecSys: Exploits the LLM-like statistics of OneRec-V2 to apply aggressive FP8 quantization that would fail on high-variance rankers
vs. Standard LLM Quantization: Adapts techniques to Mixture-of-Experts (MoE) in recommendation, using block-wise scaling to handle routing sparsity
vs. QServe/Atom [not cited in paper]: Focuses specifically on the recommendation domain constraints rather than generic LLM serving

Limitations

Quantization is only applied to compute-dominant Linear and MoE layers; other sensitive components remain in FP16
Relies on custom infrastructure (RecoGEM) rather than off-the-shelf serving frameworks
Evaluation is limited to OneRec-V2; generalizability to other generative recommender architectures is not explicitly tested

📊 Experiments & Results

Evaluation Setup

Online serving environment simulation

Benchmarks:

Production Traffic Simulation (Single-column short-video recommendation) [New]

Metrics:

End-to-end Latency (ms)
Throughput (QPS)
Online Recommendation Core Metrics (A/B testing)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
System performance comparisons showing massive gains in speed and capacity from the proposed FP8 optimization.
Production Simulation	Latency (ms)	139	70	-69
Production Simulation	Throughput (QPS)	205	394	+189
Throughput Gain Breakdown	Relative Gain (%)	0	27	+27
Throughput Gain Breakdown	Relative Gain (%)	0	42	+42

Experiment Figures

Log-scale histograms comparing weight/activation statistics (Variance, AbsMax) across Traditional Recommendation, OneRec-V2, and Qwen3-8B.

Breakdown of throughput improvements attributed to Infrastructure, Quantization, and Operator Optimizations.

Main Takeaways

Generative recommendation models (OneRec-V2) exhibit weight/activation statistics much closer to LLMs than to traditional rankers, making them suitable for FP8 quantization.
The combination of quantization and infrastructure optimization shifts the workload from memory-bound to compute-bound, maximizing hardware utilization.
Block-wise quantization for MoE layers effectively preserves accuracy while delivering significant speedups.

📚 Prerequisite Knowledge

Prerequisites

Understanding of quantization (FP16 vs FP8, calibration, scale factors)
Familiarity with Recommender Systems (Embedding tables, Ranking, MoE)
Knowledge of GPU hardware efficiency (Memory bound vs Compute bound)

Key Terms

OneRec-V2: A generative recommendation model that unifies retrieval and ranking into a conditional sequence generation task, featuring dense computation paths similar to LLMs

FP8: Floating Point 8—a low-precision number format that speeds up matrix multiplication on modern GPUs but requires careful scaling to avoid precision loss

MoE: Mixture of Experts—a neural architecture where different parts of the network (experts) activate for different inputs, allowing huge model capacity with lower compute cost per token

PTQ: Post-Training Quantization—converting a pre-trained model to lower precision without re-training from scratch

GEMM: General Matrix Multiply—the fundamental operation in dense neural networks

RecoGEM: The authors' optimized inference infrastructure library designed to replace standard frameworks like PyTorch or ONNX-Runtime for this specific workload

TopK: An operation to select the K highest probability items; optimized here using radix sort

TMA: Tensor Memory Accelerator—a hardware feature in NVIDIA Hopper GPUs for efficient data movement