OneRec-Think: In-Text Reasoning for Generative Recommendation

📝 Paper Summary

Generative Recommendation Reasoning in LLMs

OneRec-Think integrates explicit textual reasoning into generative recommendation via a unified LLM framework, using a novel beam-aware reward function to optimize for the multi-valid nature of user preferences.

Core Problem

Existing generative recommenders (like OneRec) operate as black-box implicit predictors lacking explicit reasoning, while current reasoning-based methods are often limited to reranking or lack scalability.

Why it matters:

Implicit models cannot explain why an item was chosen, reducing user trust and interpretability
Standard Chain-of-Thought methods fail in recommendation because user preferences are 'multi-valid' (many correct answers), causing sparse rewards during training
Deploying heavy reasoning models in real-time industrial systems is computationally prohibitive due to latency constraints

Concrete Example: A user who just watched several sad movies might want a comedy next to cheer up. A standard generative model might just predict another sad movie based on pattern matching. OneRec-Think generates a rationale: 'The user has watched intense dramas; to alleviate negative emotions, a relaxing comedy is appropriate,' and then predicts the comedy.

Key Novelty

Unified Generative Reasoning with Rollout-Beam Reward

Interleaves 'itemic' tokens (representing items) with natural language text to perform reasoning and recommendation in a single autoregressive flow
Introduces 'Rollout-Beam' reward for Reinforcement Learning, which credits a reasoning path if *any* item in the subsequent beam search matches the target, addressing reward sparsity
Uses a 'Think-Ahead' inference architecture that pre-computes reasoning and initial tokens offline, allowing real-time completion online

Architecture

Overview of the OneRec-Think framework including Itemic Alignment, Reasoning Activation, and Reasoning Enhancement stages.

Evaluation Highlights

Achieves state-of-the-art results on Amazon Beauty, Toys, and Sports datasets, outperforming both sequential (SASRec) and generative (ReaRec, TIGER) baselines
In live industrial deployment on Kuaishou, increases APP Stay Time by 0.159% against a strong online baseline
Ablation studies show that adding reasoning to the aligned base model improves Recall@5 by over 10% on the Beauty dataset

Breakthrough Assessment

8/10

Successfully bridges the gap between explicit LLM reasoning and industrial-scale generative recommendation with a practical solution for latency (Think-Ahead) and training stability (Rollout-Beam).

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation as autoregressive token generation

Inputs: User interaction history sequence S_u = (s_v1, ..., s_vn) consisting of itemic tokens

Outputs: A reasoning sequence tau followed by the next itemic tokens s_vn+1

Pipeline Flow

Data Construction (Pruned Contexts) -> Reasoning Activation (SFT)
Reasoning Enhancement (RL with GRPO)
Industrial Inference (Think-Ahead Architecture)

System Modules

Itemic Alignment

Map item semantics into the LLM's textual embedding space

Model or implementation: Qwen-8B (Base LLM)

Reasoning Activation

Teach the model to generate reasoning paths (rationales) before recommendations

Model or implementation: Qwen-8B (Aligned)

Reasoning Enhancement

Optimize reasoning quality using reward signals tailored for multi-valid recommendation

Model or implementation: Qwen-8B (SFT)

Novel Architectural Elements

Think-Ahead Inference: Decouples inference into an offline stage (reasoning + prefix generation) and an online stage (constrained prefix completion)
Rollout-Beam Reward mechanism: Integrates beam search logic directly into the RL reward calculation

Modeling

Base Model: Qwen-8B

Training Method: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (GRPO)

Objective Functions:

Purpose: SFT for Reasoning Activation.

Formally: Minimize negative log-likelihood of generating rationale tau and target item s_vn+1 given history H.
Purpose: RL for Reasoning Enhancement.

Formally: Maximize the Rollout-Beam reward using GRPO, where Reward is 1 if target is in BeamSearch(P, K), else 0.

Trainable Parameters: Full model parameters (after initial frozen warm-up of item tokens)

Training Data:

Amazon Beauty, Toys, Sports datasets for academic benchmarks
Kuaishou industrial logs for production

Key Hyperparameters:

vocab_extension: 24,576 new tokens (8,192 per level x 3 levels)
beam_width_K: Not explicitly reported for training, K=5/10 for eval
production_training: Daily updates on 80 flagship GPUs, 20B tokens/day

Compute: 80 flagship GPUs for daily industrial incremental training

Comparison to Prior Work

vs. OneRec: OneRec-Think adds explicit textual reasoning layer and RL optimization
vs. ReaRec: OneRec-Think generates explicit human-readable rationales (In-Text) rather than latent vectors
vs. CoT-Rec [not cited in paper]: OneRec-Think integrates reasoning into the generation process end-to-end rather than as a separate prompting stage

Limitations

Industrial deployment requires complex architecture (Think-Ahead) to manage latency
Reliance on large-scale proprietary user logs for the full industrial performance gains
Reasoning generation adds computational cost compared to direct ID prediction models

Reproducibility

Code: https://github.com/wangshy31/OneRec-Think

Code and data available at https://github.com/wangshy31/OneRec-Think. Industrial dataset (Kuaishou) is proprietary. Amazon datasets are public.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on Amazon datasets and online A/B testing

Benchmarks:

Amazon Beauty (Sequential Recommendation)
Amazon Toys (Sequential Recommendation)
Amazon Sports (Sequential Recommendation)

Metrics:

Recall@K (R@K)
NDCG@K (N@K)
APP Stay Time (Industrial)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
OneRec-Think consistently outperforms state-of-the-art baselines, including both traditional sequential models and recent generative approaches, across all three Amazon datasets.
Amazon Beauty	Recall@5	0.0701	0.0768	+0.0067
Amazon Toys	Recall@5	0.0725	0.0805	+0.0080
Amazon Sports	Recall@5	0.0461	0.0525	+0.0064
Ablation studies confirm that both Itemic Alignment and Reasoning components are essential for performance.
Amazon Beauty	Recall@5	0.0658	0.0768	+0.0110
Industrial A/B testing on Kuaishou shows significant engagement gains.
Kuaishou App	APP Stay Time	0.000	0.159	+0.159%

Experiment Figures

Case study of conversational recommendation where the model detects negative user emotion and shifts recommendation strategy.

Consistency analysis checking if the generated reasoning actually aligns with the recommended items.

Main Takeaways

Explicit reasoning significantly boosts recommendation accuracy compared to implicit generative models (OneRec) and latent reasoning models (ReaRec)
The combination of Itemic Alignment and Reasoning is synergistic; neither alone achieves SOTA performance
The 'Think-Ahead' architecture enables the deployment of complex reasoning models in latency-sensitive industrial environments with measurable business impact

📚 Prerequisite Knowledge

Prerequisites

Generative Retrieval / Generative Recommendation
Large Language Models (LLMs)
Reinforcement Learning (PPO/GRPO)
Chain-of-Thought (CoT)

Key Terms

itemic token: A discrete, semantic-rich token representation of an item (analogous to a word token) used to map items into the LLM's vocabulary

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to stabilize training without a value function

Rollout-Beam Reward: A novel reward function that assigns a high score to a reasoning path if the target item appears anywhere in the top-K beam search results of the generation phase

Think-Ahead Architecture: An inference strategy where heavy reasoning and initial item tokens are generated offline, and only the final tokens are generated online to save latency