RecoWorld: Building Simulated Environments for Agentic Recommender Systems

📝 Paper Summary

Recommender System Simulation Agentic AI Evaluation

RecoWorld is a blueprint for a simulated environment where LLM-based users provide natural language instructions to recommender agents, enabling the training of interactive systems without risking real user retention.

Core Problem

Traditional offline evaluation metrics (like Recall) suffer from exposure bias, while online A/B tests are slow and risky for testing radically new agentic strategies.

Why it matters:

Existing offline metrics reinforce known patterns rather than discovering new user interests (exposure bias).
Agentic recommenders need to learn to follow instructions and plan over long horizons, capabilities that static datasets cannot evaluate.
Testing unproven agentic behaviors on real users risks degrading the user experience and causing churn.

Concrete Example: In a standard system, a bored user simply leaves. In RecoWorld, a simulated user about to churn issues an explicit instruction like 'show me more interesting content,' challenging the recommender to interpret this feedback and immediately adjust the list to retain the user.

Key Novelty

Dual-View Agentic Simulation Environment

Models the user not just as a click-generator but as an agent that reflects on dissatisfaction and issues natural language instructions (e.g., 'stop showing me sports') to the recommender.
Establishes a multi-turn feedback loop where the Recommender Agent must 'follow instructions' to maximize a long-term reward signal (session retention) rather than immediate clicks.

Evaluation Highlights

The paper presents a blueprint and architecture rather than empirical benchmark results.
Proposed evaluation compares simulated session trajectories against human annotator trajectories to validate realism.

Breakthrough Assessment

5/10

This is a position/blueprint paper proposing a novel environment design. While the concept of 'instruction-following simulation' is significant, the paper explicitly states it does not present experimental results.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn interaction between a User Simulator and an Agentic Recommender System within a session

Inputs: List of k recommended items selected from candidate set I

Outputs: User action (click, skip, leave) and optional natural language instruction (if leaving)

Pipeline Flow

RecSys generates Item List
User Simulator perceives items (Text/Multimodal/Semantic)
User Simulator updates Memory (Dynamic & Session-wise)
User Simulator decides Action (Click/Skip/Leave)
If Leave: User generates Instruction
RecSys receives Feedback & Updates List

System Modules

Item Representation

Converts item content into a format the LLM simulator can process

Model or implementation: Configurable: Text-only LLM, Multimodal LLM (e.g., Qwen-Omni), or Semantic ID backbone

Dynamic Engagement Memory

Filters infinite interaction history to retain only relevant behaviors for the current context

Model or implementation: Scoring function alpha_k = h(action, item, time | context)

User Decision & Instruction

Simulates user reaction and generates feedback instructions if dissatisfied

Model or implementation: LLM-based Simulator

Novel Architectural Elements

Dual-view loop where the environment (User) emits natural language instructions, not just scalar rewards/actions
Integration of reflective instruction generation: users explain *why* they are leaving to guide the agent

Modeling

Base Model: Varies (e.g., Qwen3-Omni, Gemini-2.5-Pro mentioned as candidates for Multimodal modeling)

Comparison to Prior Work

vs. OpenAI Gym: RecoWorld focuses specifically on instruction-following for RecSys with natural language feedback
vs. Generative Agents: RecoWorld focuses on high-frequency consumption behavior (RecSys) rather than social/daily life simulation
vs. Traditional RecSys Simulators (e.g., RecSim [not cited in paper]): RecoWorld uses LLMs for behavior generation and instruction giving, whereas traditional simulators use probabilistic graphical models or matrix factorization

Limitations

The paper does not provide experimental results or benchmark data to validate the simulator's fidelity.
Reliance on LLMs for simulation can be computationally expensive and slow compared to mathematical user models.
The fidelity of simulated users depends heavily on the underlying LLM's capability to understand complex multimodal content (e.g., sarcasm in videos).

Reproducibility

The paper is a blueprint/vision paper. No code URL is provided. The text mentions 'We introduce RecoWorld' but does not link to a repository or provide specific implementation details like prompt templates or trained weights.

📊 Experiments & Results

Evaluation Setup

Proposed validation involves comparing simulated user trajectories with human annotator trajectories given the same initial recommendations.

Benchmarks:

Custom Simulation Scenarios (User Simulation / Session Retention) [New]

Metrics:

Session Retention (Time Spent)
Daily Active Users (DAU)
Trajectory similarity to human annotators
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper is a proposal (blueprint) and does not contain quantitative experimental results.
It argues that current offline metrics (NDCG, Recall) incentivize exploitation, whereas the proposed simulator encourages exploration by optimizing for long-term retention.
The design allows for 'instruction-following' evaluation: measuring if a recommender can successfully re-engage a user after receiving explicit negative feedback.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Environment, Agent, Reward)
Recommender Systems (Recall, NDCG)
Large Language Models (Reasoning, Context)

Key Terms

Agentic Recommender System: A recommender system that acts as an autonomous agent, capable of reasoning, planning, and actively interacting with users (e.g., following instructions) rather than just ranking items.

Exposure Bias: The tendency of a model to be biased towards items that were exposed to users in the training data, ignoring potentially relevant items that were never shown.

Semantic ID: A method of representing items where content (video/audio/text) is encoded into a structured sequence of IDs, capturing semantic meaning in a compact vector form.

Instruction-Following Recommender: A system designed to dynamically update its recommendation strategy based on explicit natural language feedback or instructions from the user.

RAG: Retrieval-Augmented Generation—fetching relevant data (like user history) to prompt an LLM.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in a list.

Recall@N: The proportion of relevant items found in the top N recommendations.

Gym: A standard interface for reinforcement learning environments developed by OpenAI.

VLM: Vision-Language Model—an AI model capable of processing and understanding both images/video and text.