Few-shot Personalization of LLMs with Mis-aligned Responses

📝 Paper Summary

User-profile based personalization Prompt Optimization

Fermi personalizes LLMs by iteratively optimizing prompts using feedback from the model's own errors (mis-aligned responses) and dynamically selecting the best prompt for each query based on context.

Core Problem

Existing LLM personalization methods either rely on manual, suboptimal prompt engineering or require fine-tuning on shared user data, which raises privacy concerns.

Why it matters:

LLMs often exhibit biases toward certain groups and fail to adapt to diverse individual needs without specific steering
Manual prompt engineering is costly and fails to explore the search space effectively for each unique user
Learning-based approaches (fine-tuning) often assume access to other users' data, violating privacy constraints

Concrete Example: When a user asks a subjective question, a standard LLM might provide a generic or biased answer. Simple instructions like 'act as X' often fail. Fermi identifies these 'mis-aligned' responses (errors) on past questions and uses them to write a prompt that explicitly corrects the model's behavior for that user.

Key Novelty

Few-shot Personalization with Mis-aligned Responses (Fermi)

Optimizes prompts by feeding the optimizer LLM not just scores, but also specific examples of 'mis-aligned responses' (where the model answered incorrectly), allowing it to diagnose and fix specific failure modes
Uses 'Retrieval-or-Prompt' during inference: instead of using one fixed prompt, it retrieves relevant past user opinions for the current query and selects the prompt that performed best on those specific similar examples

Architecture

Overview of the Fermi framework, illustrating both the iterative optimization process (left) and the Retrieval-or-Prompt inference method (right).

Evaluation Highlights

+6.8% average accuracy improvement on the first multiple-choice QA dataset compared to state-of-the-art prompt optimization baselines
+4.1% average accuracy improvement on the second multiple-choice QA dataset compared to baselines
Demonstrates that prompts personalized via one LLM transfer effectively to other LLMs (both API-based and open-source)

Breakthrough Assessment

7/10

Introduces a clever error-driven feedback loop for prompt optimization and a dynamic inference strategy. The gains are significant, though the core mechanism is an evolution of OPRO-style optimization.

⚙️ Technical Details

Problem Definition

Setting: Few-shot personalization where the goal is to predict a user's answer to a question given their profile and a small set of past opinions

Inputs: Test question q, User Profile U_pro, User Opinions U_opi (set of N QA pairs)

Outputs: Predicted answer a

Pipeline Flow

Retrieval: Find relevant past opinions -> Selection: Choose best prompt -> Generation: Predict answer

System Modules

Retriever (Retrieval & Selection)

Identify past user questions most relevant to the current test query to contextualize prompt selection

Model or implementation: Sentence Encoder (e.g., SBERT)

Prompt Selector (Retrieval & Selection)

Select the specific prompt from the pre-optimized set that maximizes performance on the retrieved subset of opinions

Model or implementation: Scoring Function (Rule-based)

Generator

Generate the personalized answer using the selected prompt and user profile

Model or implementation: Target LLM (e.g., Llama-2, GPT-3.5)

Novel Architectural Elements

Dynamic inference selection (Retrieval-or-Prompt) that couples retrieval of history with prompt selection, rather than using a single static prompt for all queries
Optimization pipeline (offline) that explicitly formats 'mis-aligned responses' (errors) into the context window of the optimizer LLM

Modeling

Base Model: Evaluated on multiple models including Llama-2-chat (7B/13B), GPT-3.5-turbo, GPT-4

Compute: Not reported in the paper

Comparison to Prior Work

vs. OPRO: Fermi includes the context of 'mis-aligned responses' (errors) in the optimizer's input, whereas OPRO only uses prompt strings and scores
vs. PE2: Fermi optimizes the instruction prompt itself rather than just selecting few-shot examples
vs. Vanilla Prompting: Fermi automates the search for user-specific prompts instead of using fixed templates
+ 1 more
vs. Gradient-based Tuning (Soft Prompts): Fermi is gradient-free and works on black-box API models

Limitations

Relies on the availability of a powerful Optimizer LLM (e.g., GPT-4) to generate high-quality prompts
Inference cost is slightly increased due to the retrieval and selection step (calculating scores on subsets of history)
Requires a minimum number of user historical opinions (N) to drive the optimization process effectively

Reproducibility

Code: https://github.com/bbuing9/Fermi

Code is publicly available at https://github.com/bbuing9/Fermi. The paper describes the prompt templates (Appendix B) and the algorithm (Algorithm 1) in detail.

📊 Experiments & Results

Evaluation Setup

Personalized Question Answering (QA) where models must predict a user's answer selection

Benchmarks:

Multiple-choice QA datasets (Personalized QA)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

An example of the input constructed for the Optimizer LLM, showing how prompts, scores, and mis-aligned contexts are formatted.

Main Takeaways

Fermi significantly outperforms heuristic baselines and previous prompt optimization methods (OPRO) across QA benchmarks (+6.8% and +4.1% accuracy), proving the value of error-driven feedback.
The 'Retrieval-or-Prompt' inference strategy is crucial; selecting prompts dynamically based on query context yields better personalization than a single static optimized prompt.
Mis-aligned responses (errors) contain unique signal types or patterns of wrong predictions that scores alone cannot capture, helping the optimizer LLM navigate the prompt space more effectively.
Personalized prompts generated by Fermi transfer well across different models, allowing prompts optimized on a stronger model to be used on smaller/open-source models.

📚 Prerequisite Knowledge

Prerequisites

Prompt Engineering / Prompt Optimization
In-context Learning
Retrieval-Augmented Generation (RAG)

Key Terms

mis-aligned responses: Model outputs that contradict or fail to match the user's provided ground-truth opinions/answers

OPRO: Optimization by PROmpting—a method where an LLM is used as an optimizer to generate better prompts based on past performance scores

U_pro: User profile information (e.g., demographics, ideology)

U_opi: Set of few-shot previous opinions (QA pairs) provided by the user

Retrieval-or-Prompt: The inference strategy proposed by Fermi that selects a prompt based on the similarity of the test query to past examples

optimization memory: A buffer storing tuples of (prompt, score, context) used by the optimizer LLM to generate improved prompts