Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

📝 Paper Summary

Personalized Dialogue Systems Dynamic User Modeling

RLPA trains language models to dynamically infer and adapt to user profiles during dialogue by optimizing a dual-level reward signal (profile accuracy and response alignment) via interaction with a simulated user.

Core Problem

Existing personalized alignment methods (like Prompting, SFT, DPO) rely on static datasets or templates, failing to adapt in 'cold-start' scenarios where user preferences must be inferred dynamically from interaction.

Why it matters:

Static methods cannot handle the evolving nature of long-term human-AI interaction where users reveal preferences gradually.
Offline optimization (SFT/DPO) requires large labeled datasets of user profiles, which are unavailable for new users (cold-start problem).
Prompt-based personalization is superficial and constrained by context window limits, often failing to maintain coherence over long conversations.

Concrete Example: In a cold-start scenario, a new user might implicitly reveal a dietary restriction halfway through a chat. A static SFT model, trained on fixed profile-response pairs, might miss this subtle cue because it lacks an explicit mechanism to update its internal user state, leading to a recommendation that violates the user's needs.

Key Novelty

Reinforcement Learning for Personalized Alignment (RLPA)

Formulates personalization as a multi-turn Markov Decision Process (MDP) where the model explicitly generates and updates a structured user profile estimate at every turn.
Trains the model using a Simulated User (GPT-4o-mini) that holds a hidden profile and reveals it gradually, removing the need for static human-labeled datasets.
optimizes a dual-reward objective: a 'Profile Reward' for accurately guessing the hidden user attributes and a 'Response Reward' for generating replies that match that inferred profile.

Architecture

The RLPA training framework, illustrating the interaction loop between the Agent and the Simulated User.

Evaluation Highlights

Achieves 66.86 average alignment score on ALOE Vanilla benchmark, outperforming Supervised Fine-Tuning (SFT) by +29.06 points.
Surpasses GPT-4o on the Extended ALOE benchmark (unseen attribute types) with a score of 67.12 vs 66.52, demonstrating superior generalization.
Maintains stable performance over 10 dialogue turns while baselines like SFT and DPO degrade significantly after turn 5 (visualized in Figure 3).

Breakthrough Assessment

8/10

Significant shift from static to dynamic personalization using RL and simulated users. The performance gains over strong baselines (including GPT-4o) in generalization settings are impressive.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn Markov Decision Process (MDP) for personalized dialogue

Inputs: Dialogue history sequence containing user utterances and past model responses

Outputs: An updated structured user profile estimate and a natural language response aligned with that profile

Pipeline Flow

User Simulator (Generates utterance based on hidden profile)
Agent (Generates response + Inferred Profile)
Reward Calculation (Profile Match + Response Quality)
PPO Update

System Modules

User Simulator

Simulates a user with a specific, hidden profile who reveals preferences gradually over the conversation

Model or implementation: GPT-4o-mini

Dialogue Agent

Infers the user profile from history and generates a personalized response

Model or implementation: Qwen-2.5-3B-Instruct (Fine-tuned)

Profile Reward Model (Reward System)

Evaluates the accuracy of the agent's inferred profile against the simulator's ground truth

Model or implementation: Rule-based (Slot Matching)

Response Reward Model (Reward System)

Evaluates if the response effectively incorporates the inferred profile constraints

Model or implementation: GPT-4o-mini (as Judge)

Novel Architectural Elements

Dual-reward mechanism explicitly decoupling 'profile inference accuracy' from 'response quality'
Integration of explicit profile state generation into the dialogue policy output space

Modeling

Base Model: Qwen-2.5-3B-Instruct

Training Method: Reinforcement Learning (PPO)

Objective Functions:

Purpose: Encourage accurate tracking of user attributes.

Formally: R_profile = Slot-wise matching score between predicted profile and ground truth.
Purpose: Encourage responses that reflect the profile.

Formally: R_response = LLM-based score (0-1) checking alignment, style, and logic.
Purpose: Joint optimization.

Formally: R_total = R_profile + R_response (optimized via PPO).

Training Data:

ALOE training set processed into slot-value profiles
Simulated interactions generated on-the-fly during RL

Key Hyperparameters:

user_simulator: GPT-4o-mini
reward_model: GPT-4o-mini
discount_factor_gamma: Not explicitly reported in the paper (symbol gamma used in formulation)
+ 2 more
ppo_clip_epsilon: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: 8 NVIDIA A100 80GB GPUs

Comparison to Prior Work

vs. SFT/DPO: RLPA learns from dynamic interaction rather than static datasets, enabling better cold-start adaptation.
vs. Prompting: RLPA updates parametric knowledge and profile state, avoiding context window limits and superficial personalization.
vs. GPT-4o: RLPA explicitly models the user profile state, leading to higher consistency (N-R2) despite being a much smaller model (3B parameters).

Limitations

Relies on a simulated user (GPT-4o-mini) during training, which may limit the diversity of learned behaviors to the simulator's capabilities.
Requires an explicit profile schema (slot-value) for the Profile Reward, potentially limiting flexibility compared to unstructured profiles.
Inference efficiency comparisons are made against reasoning models (DeepSeek-R1), but RLPA adds the overhead of profile generation at each turn.

Reproducibility

Code: https://github.com/XingYuSSS/RLPA

Code is publicly available at https://github.com/XingYuSSS/RLPA. Hyperparameters are referenced as being in Appendix E but Appendix E is not included in the source text. User simulator and reward models use GPT-4o-mini (closed source dependency).

📊 Experiments & Results

Evaluation Setup

Multi-turn dialogue with personalized profiles using the ALOE benchmark.

Benchmarks:

ALOE (Vanilla) (In-Format Generalization (unseen content, same schema))
ALOE (Extended) (Cross-Format Generalization (unseen attribute types and values))

Metrics:

Average Alignment Score (AVG.)
Normalized Improvement Ratio (N-IR)
Normalized Coefficient of Determination (N-R2)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the Vanilla ALOE benchmark (same profile schema as training). Qwen-RLPA demonstrates massive gains over offline baselines.
ALOE (Vanilla)	Alignment Score	37.80	66.86	+29.06
ALOE (Vanilla)	Alignment Score	38.75	66.86	+28.11
ALOE (Vanilla)	N-R2	0.380	0.855	+0.475
Performance on the Extended ALOE benchmark (new profile schemas/attributes). RLPA shows superior generalization.
ALOE (Extended)	Alignment Score	39.04	67.12	+28.08
ALOE (Extended)	Alignment Score	66.52	67.12	+0.60

Experiment Figures

Turn-wise alignment scores on the Extended ALOE benchmark for RLPA vs. SFT vs. DPO vs. Reminder.

Main Takeaways

RLPA consistently outperforms prompt-based and offline optimization (SFT/DPO) baselines by wide margins (~28-29 points) in personalized alignment.
The method demonstrates strong 'In-Format' and 'Cross-Format' generalization, effectively handling unseen profile attributes.
Temporal analysis (Figure 3) reveals that while SFT/DPO performance degrades after ~5 turns, RLPA's alignment score steadily increases, validating the effectiveness of dynamic profile refinement.
RLPA achieves better profile-response consistency (N-R2) than significantly larger proprietary models like GPT-4o.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals, specifically PPO
Markov Decision Processes (MDP)
Dialogue Systems and User Simulation
LLM Alignment (SFT, DPO)

Key Terms

RLPA: Reinforcement Learning for Personalized Alignment—the proposed framework using simulated users and dual rewards.

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker.

Cold-start: The scenario where a system must serve a new user without having any prior historical data or profile for them.

SFT: Supervised Fine-Tuning—training a model on a fixed dataset of inputs and target outputs.

DPO: Direct Preference Optimization—a method to align models to preferences without a separate reward model, typically using static pairs of chosen/rejected responses.

ALOE: A benchmark for evaluating personalized dialogue systems, containing dialogues annotated with user profiles.

Slot-value format: A structured representation of information where specific categories (slots) are assigned specific contents (values).

PPO: Proximal Policy Optimization—a policy gradient RL algorithm that optimizes the model while preventing drastic updates that could destabilize training.