LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

📝 Paper Summary

Personalized Assistant Benchmarking User Simulation

LifeSim evaluates personalized assistants by simulating users with evolving internal cognitive states and realistic life trajectories, revealing that current models struggle with implicit intentions over long contexts.

Core Problem

Existing benchmarks for AI assistants rely on static, short-context datasets that fail to capture the complexity of real-world interactions where user needs evolve based on dynamic external environments and internal cognitive states.

Why it matters:

Real-world user needs are shaped by temporal and situational contexts (e.g., location, weather), which static Q&A benchmarks cannot replicate
Privacy constraints limit access to real long-term interaction logs, creating a blind spot in evaluating how models handle personal evolution over time
Current evaluation methods overlook 'implicit intentions'—needs that are not explicitly stated but must be inferred from long-term history and habits

Concrete Example: A user with a soy allergy asks 'Can you recommend a quick lunch?' A standard model might suggest a chicken salad with soy dressing. A personalized assistant should infer the implicit constraint from past history (the allergy) and current context (time/location) to recommend a compliant meal.

Key Novelty

Belief-Desire-Intention (BDI) Grounded User Simulation

Models the user not just as a profile, but as a cognitive agent with a BDI architecture: 'Beliefs' (world view), 'Desires' (potential goals), and 'Intentions' (committed actions)
Integrates an Event Engine that generates life trajectories (events like gym, work, dining) strictly grounded in real-world mobility data and Lewin’s equation (Behavior = f(Person, Environment))

Architecture

Overview of the LifeSim framework components and data flow

Evaluation Highlights

GPT-5 shows a massive performance drop of 27.3 points between Explicit Intent Recognition (79.5) and Implicit Intent Recognition (52.2), highlighting a critical reasoning gap
DeepSeek-V3.2 achieves the highest Persona Alignment score (75.5) among all models, outperforming GPT-4o (74.1) and Claude-Sonnet-4.5 (75.5)
Long-horizon evaluation reveals that while Explicit Intention Completion remains stable (~85 for Qwen3 32B) as context grows to 16K tokens, Implicit Intention Completion degrades significantly (from 49 to 18)

Breakthrough Assessment

8/10

Proposes a highly sophisticated, cognitively grounded simulation framework that exposes significant weaknesses in state-of-the-art models regarding implicit reasoning, addressing a major gap in personalization benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Simulated multi-turn user-assistant interaction over long horizons with evolving contexts

Inputs: User profile P, Environmental context Env(Time, Location, Weather), Dialogue History H

Outputs: Assistant Response R aimed at satisfying user Intention I

Pipeline Flow

Group: Simulation Generation: Cognitive Engine (Generates Intent) → Event Engine (Grounds in Physics) → Behavior Engine (Generates Dialogue)
Group: Evaluation: Evaluator Model (Judges Assistant Response)

System Modules

Cognitive Engine (Simulation Generation)

Determines the user's internal mental state and selects the next intention

Model or implementation: DeepSeek-V3.2

Event Engine (Simulation Generation)

Constructs realistic spatiotemporal life trajectories

Model or implementation: Algorithmic / Probabilistic Logic

User Behavior Engine (Simulation Generation)

Generates the actual conversational utterances for the simulated user

Model or implementation: Qwen3-32B

Novel Architectural Elements

Integration of BDI (Belief-Desire-Intention) logic into LLM-based user simulation to model internal consistency
Coupling of internal cognitive states with an external Event Engine grounded in real-world POI/mobility data

Modeling

Base Model: DeepSeek-V3.2 (Simulator Backbone), Qwen3-32B (User Agent)

Training Method: Not applicable - The paper introduces a simulator and benchmark, not a trained model

Adaptation: None

Trainable Parameters: None

Key Hyperparameters:

temperature: 1.0
memory_similarity_threshold: 0.7

Compute: Experiments conducted on 8 NVIDIA RTX 4090 GPUs. Inference-only evaluation.

Comparison to Prior Work

vs. UserSimCRS: LifeSim uses LLM-driven BDI cognitive modeling vs. rule/agenda-based generation
vs. ProPerSim: LifeSim supports multi-turn dialogue interactions vs. single-turn recommendations
vs. Sotopia: LifeSim integrates realistic physical environments (time/location/weather) impacting user needs vs. purely social relationship focus
+ 1 more
vs. Agent-based User Simulators [not cited in paper]: LifeSim explicitly models implicit intentions arising from long-term history, whereas most user sims focus on explicit goal achievement

Limitations

Currently excludes high-stakes domains like healthcare, legal, and finance due to complexity and risk
Relies purely on textual interactions, missing multimodal signals (visual context, physiological data) relevant to real-world assistants
Performance of the simulator itself depends on the underlying LLM's capability (DeepSeek/Qwen3), which may introduce bias

Reproducibility

Code: https://github.com/dfy37/lifesim

Code and data available at https://github.com/dfy37/lifesim. User pool constructed from SocioVerse (Twitter) and AlignX. Mobility data from Foursquare. 1,200 evaluation scenarios explicitly defined.

📊 Experiments & Results

Evaluation Setup

LLM-as-a-Judge evaluation of assistant responses in simulated scenarios

Benchmarks:

LifeSim-Eval (Long-horizon Personalized Assistance) [New]

Metrics:

Intent Recognition (Explicit/Implicit)
Intent Completion (Explicit/Implicit)
Naturalness
Coherence
Profile Recovery
Persona Alignment
Statistical methodology: Krippendorff’s alpha used to measure agreement between LLM judges and human annotators (alpha = 0.80)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of models on Single-Scenario tasks, showing the significant gap between explicit and implicit intention handling.
LifeSim-Eval	Intent Recognition (Explicit)	71.7	79.5	+7.8
LifeSim-Eval	Intent Recognition (Implicit)	79.5	52.2	-27.3
LifeSim-Eval	Intent Completion (Implicit)	35.5	48.9	+13.4
LifeSim-Eval	Persona Alignment	74.1	75.5	+1.4
Ablation study on the effect of incorporating a Profile Memory module for long-horizon preference recovery.
LifeSim-Eval	Profile Recovery (Scenario 10)	60	68	+8

Experiment Figures

Heatmaps comparing Implicit vs Explicit Intention Completion scores across conversation history lengths (1K to 16K tokens) for various models

Main Takeaways

Current LLMs exhibit a profound gap between explicit instruction following and implicit intention recognition, often failing to infer needs from context
Performance in satisfying implicit intentions degrades significantly as conversation history length increases (up to 16K tokens), indicating poor long-context reasoning for personalization
While external memory modules improve profile recovery, they do not guarantee better reasoning; some models (e.g., Qwen3-8B) show negligible improvement with memory, suggesting reasoning capability is the bottleneck

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting
User Simulation methodology
Belief-Desire-Intention (BDI) cognitive architecture

Key Terms

BDI: Belief-Desire-Intention—a software model of practical reasoning used to program intelligent agents, separating agent state into beliefs (what they know), desires (what they want), and intentions (what they choose to do)

Implicit Intention: User goals that are not explicitly stated in the current utterance but must be inferred from context, history, or user constraints (e.g., 'I'm hungry' + history of veganism = 'I want vegan food')

POI: Point of Interest—a specific location (e.g., a gym, a restaurant) used in mobility datasets to ground user trajectories

LifeSim-Eval: The benchmark suite proposed in this paper, consisting of 1,200 scenarios across 8 life domains generated by the LifeSim simulator

IPF: Iterative Proportional Fitting—a statistical procedure used here to balance user sampling distributions