Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

📝 Paper Summary

Benchmark datasets Conversational personalization

PrefEval is a benchmark of 3,000 preference-query pairs revealing that current LLMs fail to proactively follow user preferences in long conversations (accuracy <10%), though fine-tuning can mitigate this.

Core Problem

State-of-the-art LLMs struggle to proactively infer, memorize, and adhere to user preferences scattered across long conversational histories, often defaulting to generic responses.

Why it matters:

Scalability: It is more efficient to have one adaptable model for millions of users than separate fine-tuned models for each
User Satisfaction: Repeatedly stating preferences or receiving irrelevant recommendations (e.g., meat dishes for a vegetarian) degrades the user experience
Evaluation Gap: Existing benchmarks test retrieval or general reasoning but fail to measure 'proactive personalization' where preferences are implicit or distant in context

Concrete Example: If a user explicitly says 'I don't like jazz' early in a conversation, and 50 turns later asks for travel recommendations in New Orleans, the chatbot should proactively filter out jazz clubs. Current models fail to make this connection and suggest popular jazz venues.

Key Novelty

PrefEval Benchmark & Evaluation Protocol

Constructs 3,000 preference-query pairs across 20 topics, buried within realistic multi-session conversation 'noise' (up to 100k tokens)
Evaluates three levels of preference complexity: Explicit statements, Implicit choice-based dialogue, and Implicit persona-driven dialogue
Defines specific failure modes for personalization: Preference-Unaware, Hallucination, Inconsistent, and Unhelpful violations

Architecture

The conceptual framework of the PrefEval benchmark.

Evaluation Highlights

Preference following accuracy falls below 10% for most models in zero-shot settings with only 10 turns (~3k tokens) of context
Fine-tuning on the PrefEval dataset significantly improves preference following capabilities compared to zero-shot baselines
Counter-intuitively, multiple stated preferences (even conflicting ones) in a conversation lead to improved adherence, likely due to reinforced attention

Breakthrough Assessment

8/10

Identifies a critical failure mode in 'solved' long-context LLMs (personalization) and provides a comprehensive benchmark (PrefEval) to measure it. The <10% baseline performance highlights a significant gap.

⚙️ Technical Details

Problem Definition

Setting: Multi-session conversational preference following

Inputs: Conversation history C containing user preference p, unrelated distractors, and a final query q

Outputs: Response b_m that adheres to preference p

Pipeline Flow

Input Construction (History + Distractors + Query)
LLM Inference (Response Generation)
Evaluation (LLM-as-a-judge or Classification)

System Modules

Input Construction

Combines a specific preference-query pair with multi-session distractors from LMSYS-Chat-1M

Model or implementation: N/A (Data assembly)

LLM Inference

Generates the response to the final query given the long context

Model or implementation: Various (e.g., GPT-4, Claude 3, Llama-3-70B)

Evaluator

Determines if the response followed the hidden preference

Model or implementation: Claude 3 Sonnet

Novel Architectural Elements

Evaluation framework specifically designing 'distractor' sessions to test robustness of preference retention over long contexts
Taxonomy of 3 preference forms: Explicit, Implicit Choice-Based, Implicit Persona-Driven

Modeling

Base Model: Evaluated: Claude 3 (Haiku/Sonnet), Mistral 7B/8x7B, Llama 3 (8B/70B), GPT-4

Training Method: Supervised Fine-Tuning (SFT) on PrefEval dataset

Adaptation: Fine-tuning (implied, specific parameter details not in snippet)

Training Data:

PrefEval dataset (3,000 pairs)
Split into train/test sets

Compute: Not reported in the paper

Comparison to Prior Work

vs. RULER/InfiniteBench: PrefEval focuses on 'preference adherence' (personalization) rather than just factual retrieval; retrieval is necessary but not sufficient
vs. Role-playing benchmarks [not cited in paper]: Focuses on adhering to *user* constraints rather than adopting a specific *agent* persona

Limitations

Evaluation relies on LLM-as-a-judge (Claude 3 Sonnet), which incurs computational cost and potential bias
Performance drops significantly with context length, indicating current architectures are not solved for this task
Requires explicit prompting or fine-tuning to achieve acceptable performance, as zero-shot capabilities are very low

Reproducibility

Code: https://github.com/amazon-science/PrefEval

publicly available (https://prefeval.github.io/ and https://github.com/amazon-science/PrefEval). Includes code and dataset. Evaluation methodology using Claude 3 Sonnet as a judge is described with a 5% error rate validation.

📊 Experiments & Results

Evaluation Setup

Long-context conversational generation and classification

Benchmarks:

PrefEval (Personalized Response Generation & Selection) [New]

Metrics:

Preference Following Accuracy (Generation)
Selection Accuracy (Classification)
Statistical methodology: Validated LLM-based evaluation with human agreement (5% error rate on 200 samples)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PrefEval (Generation)	Accuracy (Zero-shot)	100	10	-90
PrefEval (Evaluation Protocol)	Human-Model Disagreement Rate	0	5	+5

Experiment Figures

Topic distribution of the 3,000 preference-query pairs.

Main Takeaways

State-of-the-art LLMs generally lack the ability to proactively recall and apply user preferences in zero-shot settings (<10% accuracy at 10 turns).
Fine-tuning on the PrefEval dataset is an effective method to improve preference following, generalizing well to longer contexts.
Implicit preferences (revealed through dialogue choices or persona) are significantly harder for models to track than explicit statements.
Counter-intuitively, conflicting or multiple preferences in history can improve performance, possibly by acting as reinforced attention mechanisms.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Long-Context LLM capabilities
Basic knowledge of 'LLM-as-a-judge' evaluation

Key Terms

PrefEval: The benchmark introduced in this paper, consisting of 3,000 curated preference-query pairs for evaluating personalization

Proactive Personalization: The ability of an LLM to voluntarily recall and apply past user preferences without being explicitly reminded in the current prompt

Implicit Preference: User preferences that are not stated as facts (e.g., 'I like X') but inferred from dialogue choices or persona behavior

LLM-as-a-judge: Using a strong LLM (like Claude 3 Sonnet) to evaluate the output of other models based on specific criteria

CoT: Chain-of-Thought—a prompting technique where the model explains its reasoning steps before giving a final answer

RAG: Retrieval-Augmented Generation—fetching relevant context from history to aid generation

Zero-shot: Asking the model to perform the task without providing any examples or special reminders in the prompt