User-profile based personalizationPrompt Optimization
Fermi personalizes LLMs by iteratively optimizing prompts using feedback from the model's own errors (mis-aligned responses) and dynamically selecting the best prompt for each query based on context.
Core Problem
Existing LLM personalization methods either rely on manual, suboptimal prompt engineering or require fine-tuning on shared user data, which raises privacy concerns.
Why it matters:
LLMs often exhibit biases toward certain groups and fail to adapt to diverse individual needs without specific steering
Manual prompt engineering is costly and fails to explore the search space effectively for each unique user
Learning-based approaches (fine-tuning) often assume access to other users' data, violating privacy constraints
Concrete Example:When a user asks a subjective question, a standard LLM might provide a generic or biased answer. Simple instructions like 'act as X' often fail. Fermi identifies these 'mis-aligned' responses (errors) on past questions and uses them to write a prompt that explicitly corrects the model's behavior for that user.
Key Novelty
Few-shot Personalization with Mis-aligned Responses (Fermi)
Optimizes prompts by feeding the optimizer LLM not just scores, but also specific examples of 'mis-aligned responses' (where the model answered incorrectly), allowing it to diagnose and fix specific failure modes
Uses 'Retrieval-or-Prompt' during inference: instead of using one fixed prompt, it retrieves relevant past user opinions for the current query and selects the prompt that performed best on those specific similar examples
Architecture
Overview of the Fermi framework, illustrating both the iterative optimization process (left) and the Retrieval-or-Prompt inference method (right).
Evaluation Highlights
+6.8% average accuracy improvement on the first multiple-choice QA dataset compared to state-of-the-art prompt optimization baselines
+4.1% average accuracy improvement on the second multiple-choice QA dataset compared to baselines
Demonstrates that prompts personalized via one LLM transfer effectively to other LLMs (both API-based and open-source)
Breakthrough Assessment
7/10
Introduces a clever error-driven feedback loop for prompt optimization and a dynamic inference strategy. The gains are significant, though the core mechanism is an evolution of OPRO-style optimization.
⚙️ Technical Details
Problem Definition
Setting: Few-shot personalization where the goal is to predict a user's answer to a question given their profile and a small set of past opinions
Inputs: Test question q, User Profile U_pro, User Opinions U_opi (set of N QA pairs)
Outputs: Predicted answer a
Pipeline Flow
Retrieval: Find relevant past opinions -> Selection: Choose best prompt -> Generation: Predict answer
System Modules
Retriever (Retrieval & Selection)
Identify past user questions most relevant to the current test query to contextualize prompt selection
Model or implementation: Sentence Encoder (e.g., SBERT)
Prompt Selector (Retrieval & Selection)
Select the specific prompt from the pre-optimized set that maximizes performance on the retrieved subset of opinions
Model or implementation: Scoring Function (Rule-based)
Generator
Generate the personalized answer using the selected prompt and user profile
Model or implementation: Target LLM (e.g., Llama-2, GPT-3.5)
Novel Architectural Elements
Dynamic inference selection (Retrieval-or-Prompt) that couples retrieval of history with prompt selection, rather than using a single static prompt for all queries
Optimization pipeline (offline) that explicitly formats 'mis-aligned responses' (errors) into the context window of the optimizer LLM
Modeling
Base Model: Evaluated on multiple models including Llama-2-chat (7B/13B), GPT-3.5-turbo, GPT-4
Compute: Not reported in the paper
Comparison to Prior Work
vs. OPRO: Fermi includes the context of 'mis-aligned responses' (errors) in the optimizer's input, whereas OPRO only uses prompt strings and scores
vs. PE2: Fermi optimizes the instruction prompt itself rather than just selecting few-shot examples
vs. Vanilla Prompting: Fermi automates the search for user-specific prompts instead of using fixed templates
Code is publicly available at https://github.com/bbuing9/Fermi. The paper describes the prompt templates (Appendix B) and the algorithm (Algorithm 1) in detail.
📊 Experiments & Results
Evaluation Setup
Personalized Question Answering (QA) where models must predict a user's answer selection
Benchmarks:
Multiple-choice QA datasets (Personalized QA)
Metrics:
Accuracy
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
An example of the input constructed for the Optimizer LLM, showing how prompts, scores, and mis-aligned contexts are formatted.
Main Takeaways
Fermi significantly outperforms heuristic baselines and previous prompt optimization methods (OPRO) across QA benchmarks (+6.8% and +4.1% accuracy), proving the value of error-driven feedback.
The 'Retrieval-or-Prompt' inference strategy is crucial; selecting prompts dynamically based on query context yields better personalization than a single static optimized prompt.
Mis-aligned responses (errors) contain unique signal types or patterns of wrong predictions that scores alone cannot capture, helping the optimizer LLM navigate the prompt space more effectively.
Personalized prompts generated by Fermi transfer well across different models, allowing prompts optimized on a stronger model to be used on smaller/open-source models.
📚 Prerequisite Knowledge
Prerequisites
Prompt Engineering / Prompt Optimization
In-context Learning
Retrieval-Augmented Generation (RAG)
Key Terms
mis-aligned responses: Model outputs that contradict or fail to match the user's provided ground-truth opinions/answers
OPRO: Optimization by PROmpting—a method where an LLM is used as an optimizer to generate better prompts based on past performance scores
U_pro: User profile information (e.g., demographics, ideology)
U_opi: Set of few-shot previous opinions (QA pairs) provided by the user
Retrieval-or-Prompt: The inference strategy proposed by Fermi that selects a prompt based on the similarity of the test query to past examples
optimization memory: A buffer storing tuples of (prompt, score, context) used by the optimizer LLM to generate improved prompts