Tianjun Wei, Huizhong Guo, Yingpeng Du, Zhu Sun, Huang Chen, Dongxia Wang, Jie Zhang
Nanyang Technological University, Singapore,
Zhejiang University, Hangzhou, China,
Singapore University of Technology and Design, Singapore
arXiv
(2025)
RecommendationAgentMemoryP13N
📝 Paper Summary
User Simulation for Recommender SystemsLLM Alignment
UserMirrorer aligns user simulators with human preferences by distilling high-quality training data from massive, noisy user feedback using uncertainty estimation and LLM-generated decision rationales.
Core Problem
Existing LLM-based user simulators struggle with task alignment because raw user feedback is ambiguous (lacks reasoning) and noisy, while powerful LLMs are too computationally expensive for large-scale simulation.
Why it matters:
Online testing (A/B testing) is slow (weeks/months) and raises privacy concerns, creating a need for accurate offline simulators
Raw behavioral logs (e.g., clicks) do not explain *why* a user acted, preventing models from learning the underlying decision process
Directly fine-tuning on massive, noisy feedback data is inefficient and can degrade model performance due to low-quality samples
Concrete Example:A raw log shows a user watched 'Crimson Tide', but doesn't explain if they chose it for the actor (Denzel Washington) or the genre (Thriller). A standard simulator might guess randomly. UserMirrorer generates the specific rationale (e.g., 'User prefers suspenseful military movies') to teach the simulator the correct reasoning path.
Key Novelty
UserMirrorer Framework (Uncertainty-based Data Distillation)
Transforms raw user feedback into simulation scenes and uses a powerful 'Teacher' LLM to generate explicit decision-making processes (rationales) based on the EKB consumer behavior model
Distills training data by selecting 'challenging' samples where the epistemic uncertainty gap between the Teacher (strong LLM) and Student (weak LLM) is large, ensuring the Student learns from cases it finds difficult
Filters noise by verifying that the Teacher's generated reasoning leads to the actual ground-truth user action before adding it to the training set
Architecture
The construction of a 'User Simulation Scene' from raw data. Shows how User Profile and Interaction History are converted into a 'Memory' text block, and how items are formed into an 'Exposure' list.
Evaluation Highlights
Significant qualitative improvement in alignment with human preferences compared to non-fine-tuned baselines (numeric results not included in provided text snippet)
Stronger base models (e.g., Qwen-2.5-32B) inherently align better with user behavior than weaker ones (Llama-3.2-3B) before fine-tuning
Successfully distills data from 8 diverse domains (movies, books, news, etc.) into a unified format for simulator training
Breakthrough Assessment
7/10
Addresses a critical bottleneck in Recommender Systems (data ambiguity/noise) with a logically sound uncertainty-based distillation method. However, the score is tentative as quantitative performance metrics were not available in the provided text.
⚙️ Technical Details
Problem Definition
Setting: Impression-aware user behavior simulation
Inputs: User Memory M (profile + history) and Exposure List L (set of items)
Outputs: Categorical distribution over Action Space A (interaction probabilities for items in L)
Pipeline Flow
Scene Construction (Memory + Exposure)
Decision Process Generation (LLM)
Action Prediction (LLM)
System Modules
Scene Constructor
Converts user logs into a text prompt containing profile, history, and current exposure list
Model or implementation: Template-based formatter
Decision Process Generator
Generates a step-by-step rationale for the user's choice based on the EKB model (Stimulus -> Knowledge -> Evaluation)
Model or implementation: Llama-3.2-3B-Instruct (Fine-tuned)
Action Predictor
Predicts the final user interaction based on the generated decision process
Model or implementation: Llama-3.2-3B-Instruct (Fine-tuned)
Novel Architectural Elements
Integration of EKB-based decision-making rationale generation as an intermediate step between input and action
Use of uncertainty-based distillation to selectively train on samples where the student model is uncertain but the teacher is confident
Modeling
Base Model: Llama-3.2-3B-Instruct (Student User Simulator)
Training Method: Supervised Fine-Tuning (SFT) on distilled dataset
Adaptation: Full fine-tuning (implied)
Trainable Parameters: Not reported in the paper
Training Data:
Source: 8 datasets (Movies, Books, News, etc.)
Teacher Model: Qwen-2.5-32B-Instruct
Distillation: Generate 10 decision processes per scene; compute epistemic uncertainty; select scenes with high Teacher-Student uncertainty gap; filter via rejection sampling
vs. RecSim: UserMirrorer uses LLMs to generate explanatory rationales/reasoning, not just actions
vs. Standard LLM Agents (RecMind): UserMirrorer fine-tunes a lightweight model on real user feedback rather than relying on frozen large models, reducing inference cost
vs. Direct Fine-tuning [not cited in paper]: UserMirrorer filters data based on epistemic uncertainty to select only 'challenging' and 'clean' samples rather than using all noisy logs
Limitations
Reliance on the quality of the 'Teacher' LLM (Qwen-2.5-32B) for generating ground-truth rationales
Constructed exposure lists (hybrid sampling) are approximations of what users actually saw, as real exposure logs are often unavailable
The dataset construction process is computationally intensive due to generating multiple clarifications per sample for uncertainty estimation
Code and dataset to be released at https://github.com/UserMirrorer/UserMirrorer. Templates for all datasets are in Appendix. Specific fine-tuning hyperparameters (LR, batch size) are referenced as being in Appendix C.2 but not included in the provided text.
📊 Experiments & Results
Evaluation Setup
Offline simulation of user interaction with recommender system exposure lists
Benchmarks:
MIND (News Recommendation)
7 other domains (Movies, Books, etc. (names not listed in text))
Metrics:
Accuracy (matching real user behavior)
Uncertainty (Epistemic vs Aleatoric)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Comparison of uncertainty distributions between a Strong Model (Qwen-2.5-32B) and a Weak Model (Llama-3.2-3B).
Main Takeaways
Quantitative results were not included in the provided text snippet, but preliminary experiments (Figure 2 description) indicate that stronger base LLMs (Qwen-2.5-32B, GPT-5) naturally align better with user preferences than weaker ones (Llama-3.2-3B).
Fine-tuning on user feedback significantly improves the alignment of weaker models (Llama-3.2-3B) with real user behavior.
Weaker LLMs exhibit higher epistemic uncertainty in complex scenes, validating the use of uncertainty decomposition to identify 'challenging' samples for training.
📚 Prerequisite Knowledge
Prerequisites
Recommender Systems (RS) basics (exposure, interaction history)
Large Language Models (LLMs) and fine-tuning
Bayesian Uncertainty (Aleatoric vs. Epistemic)
Key Terms
RS: Recommender Systems—algorithms that suggest items to users
User Simulator: An AI agent designed to mimic user behavior for offline testing of recommender systems
Epistemic Uncertainty: Uncertainty stemming from the model's lack of knowledge or capability (can be reduced with more data/training), as opposed to data noise
Aleatoric Uncertainty: Uncertainty inherent in the data itself (ambiguity, noise) that cannot be reduced by better modeling
Decision-making Process: An explicit text rationale generated by the LLM explaining why an item was chosen (e.g., analyzing features, weighing options)
Exposure List: The specific list of items presented to a user at a given time, from which they made a choice