OneRec-V2: A generative recommendation model that unifies retrieval and ranking into a conditional sequence generation task, featuring dense computation paths similar to LLMs
FP8: Floating Point 8—a low-precision number format that speeds up matrix multiplication on modern GPUs but requires careful scaling to avoid precision loss
MoE: Mixture of Experts—a neural architecture where different parts of the network (experts) activate for different inputs, allowing huge model capacity with lower compute cost per token
PTQ: Post-Training Quantization—converting a pre-trained model to lower precision without re-training from scratch
GEMM: General Matrix Multiply—the fundamental operation in dense neural networks
RecoGEM: The authors' optimized inference infrastructure library designed to replace standard frameworks like PyTorch or ONNX-Runtime for this specific workload
TopK: An operation to select the K highest probability items; optimized here using radix sort
TMA: Tensor Memory Accelerator—a hardware feature in NVIDIA Hopper GPUs for efficient data movement