FSPO: Few-Shot Preference Optimization—the proposed framework treating personalization as meta-learning on preference sequences.
DPO: Direct Preference Optimization—an algorithm that optimizes a policy to satisfy preferences without an explicit reward model loop.
IPO: Implicit Preference Optimization—a preference optimization objective that regularizes the policy to stay close to a reference while maximizing reward.
COT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps (here, a user description) before the final answer.
Meta-learning: A learning paradigm where the model learns to adapt to new tasks (here, new users) using a small set of examples (preferences).
ELIX: Explain Like I'm X—one of the paper's domains where the model must adapt explanations to the user's education level.
SFT: Supervised Fine-Tuning—the initial training phase on high-quality demonstrations before preference optimization.
Alpaca Eval: An automatic evaluator for instruction-following models that uses a strong LLM to judge response quality.