Rethinking Personalization in Large Language Models at the Token Level

📝 Paper Summary

Personalized Text Generation LLM Training Objectives

PerCE improves LLM personalization by identifying and upweighting specific tokens that causally depend on user profile information during training, rather than treating all tokens equally.

Core Problem

Standard training optimizes the average loss over all tokens uniformly, but personalization is sparse—only specific tokens (like stylistic choices or entity preferences) actually depend on the user profile.

Why it matters:

Treating all tokens equally dilutes the model's focus on user-specific needs, limiting personalization performance
Existing methods focus on retrieval or data synthesis but overlook that different tokens contribute to personalization to varying degrees
Standard Cross-Entropy loss fails to prioritize the tokens that actually carry the personalization signal

Concrete Example: In personalized abstract generation, personalization is reflected in stylistic tokens, whereas in dialogue, it appears in tokens encoding individual traits. A standard model treats common stopwords and these crucial personal tokens with equal importance, failing to capture the user's unique voice.

Key Novelty

PerCE (Personalized Cross-Entropy)

Uses a self-contrast metric (PerContrast) to measure the 'Personal Influence Ratio' (PIR) of each token by comparing probabilities with and without the user persona
Applies an Expectation-Maximization (EM) style training loop: first estimate token importance via PIR (E-step), then optimize the model using weighted Cross-Entropy (M-step)

Evaluation Highlights

+68.04% improvement in METEOR score on the Personalized Review Writing task (LongLaMP) with Qwen3-4B compared to standard Cross-Entropy
Achieves average gains of over 10% across all tasks and models on the LongLaMP benchmark
Demonstrates strong cross-task transfer: a Qwen3-4B model trained only on Topic Writing achieves +56.62% gain on Abstract Generation compared to the baseline

Breakthrough Assessment

8/10

Proposes a principled, theoretically grounded (causal) method for token-level personalization that yields massive empirical gains (+68%) with minimal computational overhead.

⚙️ Technical Details

Problem Definition

Setting: Conditional text generation where output depends on both a query and a user persona

Inputs: Input query x, User persona p_u

Outputs: Personalized response y

Pipeline Flow

Retriever (Fetches user history)
Generator (LLM generates response)

System Modules

Retriever

Retrieve relevant user-history entries to form the persona

Model or implementation: Contriever

Generator

Generate the personalized response

Model or implementation: Qwen3 or Llama-3.1 (Fine-tuned with PerCE)

Modeling

Base Model: Qwen3-4B, Qwen3-14B, Llama-3.1-8B-Instruct

Training Method: PerCE (Personalized Cross-Entropy Loss)

Objective Functions:

Purpose: Estimate token importance (E-step).

Formally: PIR(y_i) = log P(y_i | y_<i, x, p) - log P(y_i | y_<i, x)
Purpose: Optimize model with weighted loss (M-step).

Formally: L_PerCE = - sum( w(y_i) * log P(y_i | y_<i, x, p) ) where w(y_i) is derived from PIR

Key Hyperparameters:

learning_rate: Tested range 5e-6 to 5e-5
retrieved_documents: 4

Compute: One additional forward pass per training step with persona-removed context (approx 7% shorter context)

Comparison to Prior Work

vs. LossCE: PerCE focuses on personalization relevance (causal dependence on persona) rather than general difficulty
vs. EntCE: PerCE uses causal intervention to find personal tokens, whereas entropy just finds uncertain tokens
vs. Standard CE: PerCE applies non-uniform weights to prioritize tokens that actually reflect user traits

Limitations

Requires an additional forward pass during training (though on shorter context)
Abstract generation task showed smaller gains compared to open-ended writing tasks due to stricter constraints
Relies on the availability of user history/persona in the context window

Reproducibility

Prompt templates and hyperparameters provided in Appendix. Code availability not explicitly provided in the text snippet.

📊 Experiments & Results

Evaluation Setup

RAG-based personalized generation where user history is retrieved and appended to the prompt.

Benchmarks:

LongLaMP (Personalized Text Generation (Abstracts, Reviews, Topics))
ALOE (Personalized Multi-turn Dialogue)
LaMP (Short-text Personalization)

Metrics:

ROUGE-L
METEOR
LLM-as-a-Judge (1-5 scale for ALOE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Robustness analysis shows PerCE maintains high performance even when learning rate is increased, unlike Standard CE which collapses.
LongLaMP (PAG)	ROUGE-L	0.2261	0.3728	+0.1467
Cross-task transfer experiments demonstrate that PerCE learns generalized personalization capabilities that transfer to unseen tasks better than baselines trained on the target task.
LongLaMP (PRW)	Score (likely ROUGE-L/METEOR)	0.1898	0.2211	+0.0313
LongLaMP (PTW)	Score (likely ROUGE-L/METEOR)	0.1665	0.1955	+0.0290

Experiment Figures

Illustration of token-level personalization degrees across different tasks.

Main Takeaways

PerCE achieves substantial gains (up to 68.04%) on open-ended personalization tasks like Review Writing, confirming that weighting personal tokens is highly effective.
The method demonstrates superior stability across hyperparameters; while CE collapses at higher learning rates, PerCE remains robust.
PerCE enables strong cross-task transfer, with out-of-domain PerCE models often outperforming in-domain Standard CE models, suggesting it captures fundamental personalization patterns rather than just dataset statistics.

📚 Prerequisite Knowledge

Prerequisites

Cross-Entropy Loss
Causal Inference (Intervention and Counterfactuals)
Expectation-Maximization (EM) Algorithm
Retrieval-Augmented Generation (RAG)

Key Terms

PIR: Personal Influence Ratio—a metric measuring the difference in log-probability of a token when generated with the user persona versus without it

PerCE: Personalized Cross-Entropy—a loss function that upweights tokens with high PIR scores during training

PerContrast: The self-contrast method used to calculate PIR by performing causal intervention (masking the persona) on the input context

LongLaMP: A benchmark dataset for personalized text generation containing tasks like abstract generation, review writing, and topic writing

ALOE: A benchmark for assessing alignment to user-specific preferences in multi-turn dialogue

Contriever: A dense retrieval model used to fetch relevant user history for the prompt

Causal Effect: The difference in an outcome (token probability) caused by changing a treatment variable (presence/absence of persona)