Kesha Ou, Chenghao Wu, Xiaolei Wang, Bowen Zheng, Wayne Xin Zhao, Weitao Li, Long Zhang, Sheng Chen, Ji-Rong Wen
Gaoling School of Artificial Intelligence, Renmin University of China
arXiv
(2026)
RecommendationAgentRLMemoryP13N
📝 Paper Summary
Agentic RecommendationGenerative Recommendation
RecPilot transforms recommender systems from passive item lists to proactive assistants by using agents to simulate user exploration and generating comprehensive, interpretable reports to support decision-making.
Core Problem
Traditional recommender systems function as passive tools that simply list items, forcing users to bear the heavy cognitive burden of exploring, clicking, reading details, and synthesizing information.
Why it matters:
Selecting items (especially high-priced goods) remains a labor-intensive endeavor for users despite algorithmic advances
The 'tool-based' paradigm limits user experience by assuming users must actively participate in every step of the decision process
Existing systems facilitate access to information but fail to orchestrate the complete recommendation process to satisfy underlying intents directly
Concrete Example:In current e-commerce platforms, to buy a product, a user must browse a list, click through multiple potential items to check specs, and mentally synthesize this data. RecPilot automates this by exploring on the user's behalf and presenting a summary report.
Key Novelty
Deep Research Paradigm for Recommendation (RecPilot)
Replaces the conventional 'list of items' interface with a 'comprehensive report' derived from autonomous agent exploration
Separates the recommendation process into two agents: one that simulates the tedious browsing/clicking process to find candidates, and one that synthesizes these into a structured, readable decision guide
Architecture
The overall architecture of RecPilot, illustrating the flow from user history to the final report via two agents.
Evaluation Highlights
Achieves up to a 52% improvement in Recall@5 in modeling observed user behaviors compared to baselines
Generates novel item recommendations (going beyond superficial preference matching) in 77% of cases compared with the best baseline
Breakthrough Assessment
8/10
Proposes a fundamental shift in RecSys interaction (reports vs. lists) backed by a complex multi-agent architecture. While the evaluation details in the snippet are sparse, the paradigm shift is significant.
⚙️ Technical Details
Problem Definition
Setting: Conditional generation of user exploration trajectories and subsequent research reports
Inputs: User historical behaviors X = [(a_1, v_1), ..., (a_t, v_t)] and contextual information
Outputs: A simulated exploration session Y and a comprehensive decision-support report R
Pipeline Flow
Group A: User Trajectory Simulation Agent (History -> Simulated Path)
Group B: Report Generation Agent (Simulated Path -> Final Report)
System Modules
User Trajectory Simulation Agent
Simulates user exploration behaviors to discover relevant items without user effort
Model or implementation: Encoder-Decoder architecture (e.g., T5)
Self-Evolving Report Generation Agent
Synthesizes the simulated trajectory into a readable, structured report
Model or implementation: Large Language Model (LLM)
Novel Architectural Elements
Decoupling of item discovery (Simulation Agent) and presentation (Report Agent) into two distinct, specialized agents
Integration of a 'Rubric-Experience' dual-channel memory system for personalization that self-evolves via feedback
Use of 'process rewards' based on collaborative semantic consistency (Max-Sim pooling) to guide trajectory generation
Modeling
Base Model: T5 (for trajectory simulation), LLM (specific variant not named in snippet for report gen)
Training Method: Supervised Learning (SL) followed by Reinforcement Learning (RL) with GRPO
Objective Functions:
Purpose: Pre-train the simulation agent to copy historical patterns.
Formally: Autoregressive log-likelihood maximization of target session Y given history X
Purpose: Optimize trajectory generation using group relative rewards.
Formally: Maximize clipped objective using normalized advantages derived from outcome, process, and constraint rewards
Training Data:
User histories tokenized into 'exploration-to-decision' sessions
Consecutive behaviors of same type aggregated into action-prefix segments
Key Hyperparameters:
length_constraint_scaling_factor_mu: 0.2
sampling_strategy: Top-p sampling
Comparison to Prior Work
vs. Traditional RecSys: RecPilot generates reports instead of lists
vs. Standard LLM Rec: RecPilot explicitly simulates a multi-step exploration trajectory using RL before generating the final output, rather than direct prediction
Limitations
Depends on the availability of high-level behaviors (purchases) for rubric optimization; data sparsity is addressed via low-level behavior mining but remains a challenge
Process rewards rely on pre-learned ID embeddings, which might limit generalization if the embedding space is poor
Simulation accuracy is critical; if the simulation agent deviates significantly from realistic behavior, the downstream report will be hallucinated or irrelevant
Reproducibility
Code availability is not provided in the text. The method relies on public recommendation datasets, but specific dataset names are not listed in the snippet. Artifacts like prompts or weights are not mentioned.
📊 Experiments & Results
Evaluation Setup
Simulation of user exploration and evaluation of generated recommendations
Benchmarks:
Public recommendation datasets (Sequential Recommendation / Exploration Simulation)
Metrics:
Recall@5
NDCG
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
The user trajectory simulation agent effectively models user behavior, achieving up to 52% improvement in Recall@5 over baselines (specific values not reported in snippet).
The deep research paradigm enables the discovery of novel items (77% of cases) that go beyond simple preference matching, suggesting better exploration capabilities.
The rubric-experience dual-channel mechanism allows for self-evolution of user profiles without expensive model retraining.
RecPilot: The proposed multi-agent framework that autonomously explores item pools and generates decision-support reports
Deep Research: An information seeking paradigm where an agent autonomously interacts with systems to collect and synthesize information, inspired by OpenAI's deep research
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies by normalizing rewards within a group of sampled outputs to reduce variance
Process Reward: A reward signal given for intermediate steps (e.g., semantic consistency of a browsed item) rather than just the final outcome
Rubrics: Structured attributes used to characterize user preferences numerically (e.g., priority scores over item features)
Recall@5: A metric measuring the proportion of relevant items found in the top 5 recommendations
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items