PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

📝 Paper Summary

User-profile based personalization Layered memory (Episodic + Semantic)

PersonaAgent personalizes LLM agents at test time by treating the system prompt as a dynamic 'persona' that is iteratively optimized via textual feedback from past user interactions.

Core Problem

Current LLM agents adopt a one-size-fits-all approach that fails to adapt to specific user preferences, while fine-tuning is computationally prohibitive for individual users and standard RAG is too rigid.

Why it matters:

Users have distinct preferences (e.g., movie tastes, writing styles) that generic agents ignore, leading to suboptimal engagement
Real-world deployment requires scaling to millions of users, making per-user parameter updates (fine-tuning) infeasible due to latency and cost
Existing memory-based agents often just retrieve context without adjusting their underlying reasoning or tool-use strategy

Concrete Example: In a movie tagging task, User A prefers historical films while User C prefers sci-fi. A standard agent provides generic tags for both. PersonaAgent analyzes User C's history, rewrites its own system prompt to 'Prioritize literary connections and book-to-film adaptations,' and then correctly tags a movie based on those specific interests.

Key Novelty

Test-Time User-Preference Alignment via Persona Optimization

Defines a 'persona' not just as a static character, but as a mutable system prompt that governs tool use and memory retrieval
Uses a 'textual gradient' loop: the agent simulates responses to past user queries, critiques the error against ground truth, and rewrites its own persona prompt to minimize this 'textual loss' before helping the user

Architecture

The complete PersonaAgent framework interacting with a user, highlighting the flow between Memory, Persona, and Action modules

Evaluation Highlights

+5.7% accuracy improvement on LaMP-1 (Citation Identification) compared to MemBank (state-of-the-art memory agent)
Reduces Mean Absolute Error (MAE) by 23% on LaMP-3 (Product Rating) compared to ReAct, showing superior alignment with numeric user preferences
Achieves 55.0% accuracy on LaMP-2M using Claude-3.7, consistently outperforming baselines across model sizes (Mistral to Claude)

Breakthrough Assessment

8/10

Significantly advances personalization by replacing expensive fine-tuning with efficient test-time prompt optimization. The unified memory-action-persona framework is a strong architectural contribution for agentic AI.

⚙️ Technical Details

Problem Definition

Setting: Personalized decision-making where an agent must select actions a_t and generate responses aligned with a specific user's history D_u

Inputs: Current query q*, User Interaction History D_u (past queries/responses), Initial Persona P

Outputs: Personalized Action a_t and Final Response

Pipeline Flow

Memory Retrieval & Profiling: Retrieve episodic history + summarize semantic profile
Test-Time Alignment Loop: Simulate interaction → Generate Textual Loss → Update Persona Prompt
Inference: Optimized Persona guides Action Selection → Final Response

System Modules

Personalized Memory Module

Store and retrieve user data to ground the agent

Model or implementation: Embedding-based Retriever + LLM Summarizer

Alignment Optimizer

Refine the system prompt (persona) using past examples

Model or implementation: LLM (Claude-3.5 Sonnet)

Personalized Action Module

Execute task using tools guided by the optimized persona

Model or implementation: LLM Agent Policy

Novel Architectural Elements

Persona-as-Intermediary: A dynamic system prompt explicitly designed to bridge memory and action modules
Textual Gradient Optimization Loop: An iterative inference-time feedback loop that rewrites the system prompt based on simulated errors on user history

Modeling

Base Model: Claude-3.5 Sonnet (primary), comparisons with Mistral-Small/Large and Claude-3.7

Comparison to Prior Work

vs. RAG/PAG: PersonaAgent actively optimizes *how* to use the retrieved data (via the persona prompt) rather than just feeding data into the context window
vs. MemBank: PersonaAgent incorporates an explicit test-time alignment phase to tailor the *action policy* (persona), whereas MemBank focuses primarily on memory storage/retrieval mechanisms
vs. PEFT (Parameter-Efficient Fine-Tuning) [not cited in paper]: PersonaAgent requires no parameter updates, avoiding the storage and switching costs of maintaining per-user LoRA adapters

Limitations

Relies entirely on textual feedback, potentially missing multi-modal signals (visual/emotional cues)
Privacy risks associated with intensive processing of personalized user history data
Computational cost increases with the number of alignment iterations and simulated interactions at test time

Reproducibility

Prompt templates for alignment, feedback, and persona initialization are provided in Appendices A and B. Evaluation follows standard LaMP protocols. Code URL is not provided. Experiments used Amazon Bedrock.

📊 Experiments & Results

Evaluation Setup

Personalized decision-making tasks using the LaMP benchmark, specifically testing on the 100 users with the most extensive activity histories

Benchmarks:

LaMP-1 (Personalized Citation Identification (Binary Classification))
LaMP-2M (Personalized Movie Tagging (Multi-class Classification))
LaMP-2N (Personalized News Categorization (Multi-class Classification))
LaMP-3 (Personalized Product Rating (Regression 1-5))

Metrics:

Accuracy
F1 score
MAE (Mean Absolute Error)
RMSE (Root Mean Squared Error)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PersonaAgent consistently outperforms baseline methods (ICL, RAG, ReAct, MemBank) across all four LaMP tasks.
LaMP-1	Accuracy	0.862	0.919	+0.057
LaMP-2M	Accuracy	0.470	0.513	+0.043
LaMP-3	MAE	0.313	0.241	-0.072
Ablation studies demonstrate the critical importance of the test-time alignment mechanism.
LaMP-2M	Accuracy	0.487	0.513	+0.026

Experiment Figures

t-SNE visualization of optimized persona embeddings for different users on LaMP-2M

Scaling effects of alignment batch size, iterations, and retrieved memory count on performance

Main Takeaways

PersonaAgent achieves state-of-the-art results across diverse personalization tasks (classification and regression), outperforming both retrieval-based and agentic baselines
The test-time alignment mechanism is critical; removing it causes significant performance drops, validating the 'persona optimization' approach
Performance scales positively with model capability (Claude 3.7 > 3.5 > Mistral) and alignment batch size, showing the method is robust and scalable

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with LLM Agents (ReAct framework)
Concept of In-Context Learning (ICL)

Key Terms

Persona: In this paper, a dynamic system prompt that acts as a controller, integrating memory insights to guide agent actions and tool usage

Episodic Memory: A storage of fine-grained, time-stamped user interaction logs (query, response, metadata) retrieved via embedding similarity

Semantic Memory: Abstracted, stable user profiles and preferences summarized from episodic events to provide long-term context

Textual Loss: A natural language critique describing the discrepancy between the agent's simulated response and the ground-truth user response

Textual Gradient: The process of using the textual loss (feedback) to update the system prompt (persona), analogous to parameter updates in numerical gradient descent

LaMP: Language Model Personalization benchmark—a suite of datasets for evaluating how well LLMs can adapt to user-specific contexts

MAE: Mean Absolute Error—a metric measuring the average magnitude of errors in a set of predictions, without considering their direction

ReAct: Reasoning and Acting—a paradigm where LLMs generate reasoning traces and task-specific actions in an interleaved manner