Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

📝 Paper Summary

Conversational personalization RAG-based personalization

Pearl personalizes writing assistants by training a retriever using a difference-of-likelihoods metric to select only those historical documents that empirically improve the generation quality of specific user requests.

Core Problem

LLM writing assistants generate generic text because they lack access to user-specific style and values; standard retrieval methods fail because they assume all user history is relevant, even when requests diverge from past behavior.

Why it matters:

Personalized fine-tuning is difficult to scale and serve for millions of individual users
Users have limited historical data compared to generic retrieval corpora, making high-precision retrieval critical
Standard retrieval models (bi-encoders) are often trained on relevance, not on whether the document actually helps the LLM generate better text

Concrete Example: A user asks an assistant to draft a work email. A standard retriever might fetch a past email simply because it shares keywords, even if the tone is wrong. Pearl's retriever, trained on generation utility, would calculate that the past email does not increase the likelihood of the desired target text and would avoid selecting it, or conversely, select a document with lower keyword overlap that provides the correct stylistic template.

Key Novelty

Generation-Calibrated Retrieval for Personalization

Selects training data by using an auxiliary model to find specific request-document pairs where the document significantly increases the likelihood of the ground-truth target text compared to the request alone
Uses a scale-calibrating loss function with an 'anchor' value to ensure the retriever's scores are proportional to the actual downstream generation quality, preventing score skew common in cross-encoders

Breakthrough Assessment

7/10

Proposes a logical, theoretically grounded method for aligning retrieval with generation utility (calibration). While the 'generation-aware' retrieval concept exists, applying it to personalization with specific calibration objectives is a strong contribution.

⚙️ Technical Details

Problem Definition

Setting: Request-conditional personalized text generation using historic user-authored documents

Inputs: User request q_u and a set of historical documents D_u

Outputs: Personalized target text t_u

Pipeline Flow

Retriever (Scores historical documents against request)
Prompt Construction (Combines request + top-k documents)
Generator (Produces personalized text)

System Modules

Retriever

Score user's historical documents based on their utility for the current request

Model or implementation: MPNet (110M parameters, Cross-Encoder)

Generator

Generate the final personalized text

Model or implementation: davinci-003 or gpt-3.5-turbo

Modeling

Base Model: MPNet (Retriever backbone), FlanT5-XL (Auxiliary scorer)

Training Method: Knowledge Distillation from Auxiliary Model to Retriever

Objective Functions:

Purpose: Select training data that benefits from personalization.

Formally: Score = P_aux(t|q,d) - P_aux(t|q). Keep pairs with high positive scores.
Purpose: Train retriever to match auxiliary scores while maintaining calibration.

Formally: Scale-calibrated KL divergence loss using an anchor value y0 (median positive score) added to both target and predicted logits.

Training Data:

Partition user history into Candidate Set (past) and Target Set (future)
Use FlanT5-XL to score all Candidate-Target pairs
Select top-T requests and top-P documents per request based on likelihood difference

Key Hyperparameters:

y0_anchor: Median value of scores for positive candidate documents
auxiliary_model: FlanT5-XL (3B parameters)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Rubin et al.: Pearl explicitly selects *which* requests to train on (filtering out those that don't benefit from retrieval), whereas prior work assumes all training examples are valid.
vs. Standard Cross-Encoders: Pearl uses a scale-calibrating objective to prevent score skewing at the extremes, ensuring scores map linearly to generation quality.
vs. UPR [not cited in paper]: UPR reranks documents using the LLM's likelihood directly at inference time; Pearl distills this into a retriever for efficiency.

Limitations

Relies on an auxiliary model (FlanT5) being a good proxy for the larger generator (GPT-3.5/Davinci)
Requires historical data for every user; cold-start problem for new users is likely
Cross-encoder retrieval is computationally more expensive at inference time than bi-encoder retrieval (though performed over small user history)

Reproducibility

Not provided. The paper mentions using privacy-compliant enterprise API endpoints and private workplace datasets, suggesting code/data are proprietary. Public Reddit data is mentioned but no URL is provided in the snippet.

📊 Experiments & Results

Evaluation Setup

Personalized long-form text generation

Benchmarks:

Workplace Communications (Private dataset of emails/comms)
Reddit (Public social media comments)

Metrics:

LLM-as-a-judge (Personalized)
Intrinsic metrics (implied, e.g., Likelihood/Perplexity)
Extrinsic metrics (implied)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper asserts that Pearl consistently matches or outperforms strong baselines on both private workplace and public social media datasets (specific numbers not in text).
The 'Difference of Likelihoods' data selection strategy effectively filters out requests where retrieval would be noisy or unhelpful.
Scale calibration allows the retriever's scores to be used as a predictor of performance, enabling the system to detect low-quality generations or retrieval failures.
The approach validates that personalizing 'black-box' LLMs (via API) is viable through calibrated retrieval rather than fine-tuning.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Language Modeling (likelihoods)
Knowledge Distillation
Cross-Encoder vs. Bi-Encoder architectures

Key Terms

generation-calibrated: A property of a retriever where the score assigned to a document is proportional to how much that document improves the quality of the downstream LLM generation

cross-encoder: A retrieval architecture that processes the query and document simultaneously (concatenated) to output a relevance score, typically more accurate but computationally heavier than bi-encoders

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a second, reference probability distribution

MPNet: Masked and Permuted Pre-training for Language Understanding—a transformer-based model used here as the backbone for the retriever

FlanT5: A T5 (Text-to-Text Transfer Transformer) model fine-tuned on instructions, used here as the auxiliary model to estimate generation likelihoods

LLM: Large Language Model—a massive AI model trained on vast text data to generate human-like text

RAG: Retrieval-Augmented Generation—a technique where an LLM is provided with external documents to improve its responses

bi-encoder: A retrieval architecture where queries and documents are encoded separately into vectors, allowing fast similarity search but often with lower accuracy than cross-encoders