Mirroring Users: Towards Building Preference-aligned User Simulator with User Feedback in Recommendation

📝 Paper Summary

User Simulation for Recommender Systems LLM Alignment

UserMirrorer aligns user simulators with human preferences by distilling high-quality training data from massive, noisy user feedback using uncertainty estimation and LLM-generated decision rationales.

Core Problem

Existing LLM-based user simulators struggle with task alignment because raw user feedback is ambiguous (lacks reasoning) and noisy, while powerful LLMs are too computationally expensive for large-scale simulation.

Why it matters:

Online testing (A/B testing) is slow (weeks/months) and raises privacy concerns, creating a need for accurate offline simulators
Raw behavioral logs (e.g., clicks) do not explain *why* a user acted, preventing models from learning the underlying decision process
Directly fine-tuning on massive, noisy feedback data is inefficient and can degrade model performance due to low-quality samples

Concrete Example: A raw log shows a user watched 'Crimson Tide', but doesn't explain if they chose it for the actor (Denzel Washington) or the genre (Thriller). A standard simulator might guess randomly. UserMirrorer generates the specific rationale (e.g., 'User prefers suspenseful military movies') to teach the simulator the correct reasoning path.

Key Novelty

UserMirrorer Framework (Uncertainty-based Data Distillation)

Transforms raw user feedback into simulation scenes and uses a powerful 'Teacher' LLM to generate explicit decision-making processes (rationales) based on the EKB consumer behavior model
Distills training data by selecting 'challenging' samples where the epistemic uncertainty gap between the Teacher (strong LLM) and Student (weak LLM) is large, ensuring the Student learns from cases it finds difficult
Filters noise by verifying that the Teacher's generated reasoning leads to the actual ground-truth user action before adding it to the training set

Architecture

The construction of a 'User Simulation Scene' from raw data. Shows how User Profile and Interaction History are converted into a 'Memory' text block, and how items are formed into an 'Exposure' list.

Evaluation Highlights

Significant qualitative improvement in alignment with human preferences compared to non-fine-tuned baselines (numeric results not included in provided text snippet)
Stronger base models (e.g., Qwen-2.5-32B) inherently align better with user behavior than weaker ones (Llama-3.2-3B) before fine-tuning
Successfully distills data from 8 diverse domains (movies, books, news, etc.) into a unified format for simulator training

Breakthrough Assessment

7/10

Addresses a critical bottleneck in Recommender Systems (data ambiguity/noise) with a logically sound uncertainty-based distillation method. However, the score is tentative as quantitative performance metrics were not available in the provided text.

⚙️ Technical Details

Problem Definition

Setting: Impression-aware user behavior simulation

Inputs: User Memory M (profile + history) and Exposure List L (set of items)

Outputs: Categorical distribution over Action Space A (interaction probabilities for items in L)

Pipeline Flow

Scene Construction (Memory + Exposure)
Decision Process Generation (LLM)
Action Prediction (LLM)

System Modules

Scene Constructor

Converts user logs into a text prompt containing profile, history, and current exposure list

Model or implementation: Template-based formatter

Decision Process Generator

Generates a step-by-step rationale for the user's choice based on the EKB model (Stimulus -> Knowledge -> Evaluation)

Model or implementation: Llama-3.2-3B-Instruct (Fine-tuned)

Action Predictor

Predicts the final user interaction based on the generated decision process

Model or implementation: Llama-3.2-3B-Instruct (Fine-tuned)

Novel Architectural Elements

Integration of EKB-based decision-making rationale generation as an intermediate step between input and action
Use of uncertainty-based distillation to selectively train on samples where the student model is uncertain but the teacher is confident

Modeling

Base Model: Llama-3.2-3B-Instruct (Student User Simulator)

Training Method: Supervised Fine-Tuning (SFT) on distilled dataset

Adaptation: Full fine-tuning (implied)

Trainable Parameters: Not reported in the paper

Training Data:

Source: 8 datasets (Movies, Books, News, etc.)
Teacher Model: Qwen-2.5-32B-Instruct
Distillation: Generate 10 decision processes per scene; compute epistemic uncertainty; select scenes with high Teacher-Student uncertainty gap; filter via rejection sampling

Key Hyperparameters:

exposure_list_length_K: 32 (Top-K sampling)
final_exposure_N: Uniform sampled 2 to 12
num_clarifications_N: 10 (for uncertainty estimation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RecSim: UserMirrorer uses LLMs to generate explanatory rationales/reasoning, not just actions
vs. Standard LLM Agents (RecMind): UserMirrorer fine-tunes a lightweight model on real user feedback rather than relying on frozen large models, reducing inference cost
vs. Direct Fine-tuning [not cited in paper]: UserMirrorer filters data based on epistemic uncertainty to select only 'challenging' and 'clean' samples rather than using all noisy logs

Limitations

Reliance on the quality of the 'Teacher' LLM (Qwen-2.5-32B) for generating ground-truth rationales
Constructed exposure lists (hybrid sampling) are approximations of what users actually saw, as real exposure logs are often unavailable
The dataset construction process is computationally intensive due to generating multiple clarifications per sample for uncertainty estimation

Reproducibility

Code: https://github.com/UserMirrorer/UserMirrorer

Code and dataset to be released at https://github.com/UserMirrorer/UserMirrorer. Templates for all datasets are in Appendix. Specific fine-tuning hyperparameters (LR, batch size) are referenced as being in Appendix C.2 but not included in the provided text.

📊 Experiments & Results

Evaluation Setup

Offline simulation of user interaction with recommender system exposure lists

Benchmarks:

MIND (News Recommendation)
7 other domains (Movies, Books, etc. (names not listed in text))

Metrics:

Accuracy (matching real user behavior)
Uncertainty (Epistemic vs Aleatoric)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of uncertainty distributions between a Strong Model (Qwen-2.5-32B) and a Weak Model (Llama-3.2-3B).

Main Takeaways

Quantitative results were not included in the provided text snippet, but preliminary experiments (Figure 2 description) indicate that stronger base LLMs (Qwen-2.5-32B, GPT-5) naturally align better with user preferences than weaker ones (Llama-3.2-3B).
Fine-tuning on user feedback significantly improves the alignment of weaker models (Llama-3.2-3B) with real user behavior.
Weaker LLMs exhibit higher epistemic uncertainty in complex scenes, validating the use of uncertainty decomposition to identify 'challenging' samples for training.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (RS) basics (exposure, interaction history)
Large Language Models (LLMs) and fine-tuning
Bayesian Uncertainty (Aleatoric vs. Epistemic)

Key Terms

RS: Recommender Systems—algorithms that suggest items to users

User Simulator: An AI agent designed to mimic user behavior for offline testing of recommender systems

EKB Model: Engel Kollat Blackwell model—a theoretical framework describing consumer decision-making steps (Stimulus → Knowledge → Evaluation)

Epistemic Uncertainty: Uncertainty stemming from the model's lack of knowledge or capability (can be reduced with more data/training), as opposed to data noise

Aleatoric Uncertainty: Uncertainty inherent in the data itself (ambiguity, noise) that cannot be reduced by better modeling

Decision-making Process: An explicit text rationale generated by the LLM explaining why an item was chosen (e.g., analyzing features, weighing options)

Exposure List: The specific list of items presented to a user at a given time, from which they made a choice