LLM-Augmented Digital Twin for Policy Evaluation in Short-Video Platforms

📝 Paper Summary

Multi-agent simulation Social platform simulation

A modular digital twin architecture for short-video platforms that integrates selective LLM usage with event-driven simulation to enable safe, counterfactual policy evaluation under realistic closed-loop feedback.

Core Problem

Evaluating policies on short-video platforms is difficult because production A/B tests are risky and noisy, while existing simulators lack the semantic realism and strategic user adaptation needed for accurate counterfactuals.

Why it matters:

Platform interventions (ranking, moderation) are ethically sensitive and can introduce unfairness or social harm if deployed without rigorous testing
Closed-loop feedback (exposure shapes behavior which shapes future exposure) makes causal attribution difficult in live environments
Existing agent-based simulators rely on simplified rules that miss semantic nuances, while pure LLM agents are too slow and expensive for platform-scale simulation

Concrete Example: A change in recommendation logic might initially boost engagement, but if creators strategically adapt by producing lower-quality clickbait, long-term retention drops. Standard simulators with static agent rules fail to predict this co-evolution.

Key Novelty

Four-Twin Architecture with Tiered LLM Execution

Decomposes the ecosystem into four distinct 'twins' (User, Content, Interaction, Platform) interacting solely through a typed event bus, allowing isolated replacement of policy components
Implements a 'Live/Cached/Surrogate' execution tier that selectively uses LLMs for high-value tasks (personas, captions) and falls back to cheaper heuristics to balance fidelity with scale

Breakthrough Assessment

7/10

Proposes a robust architecture for a critical industrial problem (platform policy evaluation). The hybrid execution model addresses the key cost/realism bottleneck in LLM simulations, though experimental validation is missing from the provided text.

⚙️ Technical Details

Problem Definition

Setting: Simulation of a closed-loop short-video ecosystem involving users, creators, content, and platform algorithms

Inputs: Platform policy configurations (ranking rules, promotion thresholds) and initial agent population parameters

Outputs: Aggregate ecosystem metrics (engagement, retention, trend lifecycle) and counterfactual policy outcomes

Pipeline Flow

User Twin (Emits actions)
Orchestrator (Routes actions to Platform)
Platform Twin (Executes logic, updates Registry)
Content/Interaction Twins (Compute outcomes)
Event Bus (Propagates typed events to update state)

System Modules

User Twin

Models autonomous user agents with static attributes, evolving latent preferences, and memory

Model or implementation: Hybrid: LLM for persona generation; Ebbinghaus curve for memory; Latent vectors for preferences

Content Twin

Manages the corpus of short videos, storing abstract feature profiles rather than raw media

Model or implementation: Hybrid: LLM for captions/titles; Statistical archetype generator for metadata

Interaction Twin

Micro-level behavior engine that calculates the outcome of specific user-content encounters

Model or implementation: Calibrated probabilistic models (Hook response -> Interest match -> Watch time -> Engagement)

Platform Twin

Encapsulates platform policies (Recommendation, Governance, Promotion) and maintains the system of record

Model or implementation: Configurable policy components + Database-backed registry

Novel Architectural Elements

Four-twin modularity (User, Content, Interaction, Platform) decoupling agents from the environment and the policy logic
Three-tier optimizer (Live LLM -> Cached -> Surrogate) for budget-aware simulation execution
Explicit separation of 'Control State' (policy inputs) and 'Observational State' (realized outcomes) within the Platform Twin

Comparison to Prior Work

vs. OASIS: This paper adds a modular policy interface and explicit closed-loop feedback mechanisms, plus LLM semantic augmentation which OASIS lacks
vs. Generative Agents: This paper scales to platform-level simulation by using LLMs selectively (hybrid architecture) rather than for every step, enabling higher throughput
vs. RecSim [not cited in paper]: RecSim focuses on RL environments for recommenders; this paper adds rich semantic agent modeling via LLMs and explicit social graph dynamics

Limitations

No experimental results or quantitative validation are available in the provided text snippet
Reliance on surrogate models for scale may introduce approximation errors compared to full LLM execution
The complexity of managing four synchronized twins via an event bus may introduce latency overhead

Reproducibility

The provided text does not contain a link to the code repository or specific experimental hyperparameters, as the text is truncated before the Experiments section. The system relies on the OASIS infrastructure as a base.

📊 Experiments & Results

Evaluation Setup

Not reported in the provided text (truncated)

Metrics:

Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper proposes a novel architecture but the provided text ends before the 'Experiments' section, so no quantitative results are available.
The design emphasizes modularity to allow 'counterfactual' experiments—swapping out a recommendation policy while keeping the user population fixed.
Cost management is central to the design: the system falls back to cached or surrogate (rule-based) outcomes when LLM budgets are exceeded.

📚 Prerequisite Knowledge

Prerequisites

Agent-based modeling (ABM)
Recommender systems architecture
Basics of Large Language Models (LLMs)

Key Terms

Digital Twin: A virtual replica of a physical system (here, a social platform) used to simulate dynamics and test interventions without risking the real environment

Closed-loop feedback: A system dynamic where current outputs (e.g., user interactions) become training data or inputs for future system decisions (e.g., recommendations), creating circular dependencies

Ebbinghaus forgetting curve: A mathematical model of memory retention that declines over time, used here to simulate how user preferences for specific topics decay if not reinforced

Surrogate model: A simplified, computationally cheap model (e.g., rule-based or statistical) used as a fallback for complex models (like LLMs) to save resources

Persona: A structured description of a user agent's identity, background, and behavioral tendencies used to guide their actions and preferences

Event bus: A software architecture pattern where modules communicate by emitting and listening for events, rather than calling each other directly