← Back to Paper List

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng
International Conference on Machine Learning (2025)
RL Reasoning Benchmark

📝 Paper Summary

Inference-time alignment Test-time compute
TPO aligns LLMs during inference by converting numerical reward signals into textual critiques and instructions (textual gradients) to iteratively refine responses without updating model parameters.
Core Problem
Training-time alignment (like RLHF) is computationally expensive and slow to adapt, while existing inference-time methods often rely on opaque numerical scores that LLMs struggle to interpret for self-correction.
Why it matters:
  • Retraining models for every new preference distribution is impractical for rapidly evolving requirements
  • Purely numerical feedback (Best-of-N) does not explicitly guide the model on *how* to improve, limiting the efficiency of test-time scaling
  • Current methods fail to leverage the inherent reasoning and instruction-following capabilities of LLMs for their own alignment
Concrete Example: In a Best-of-N scenario, an LLM might generate N responses and a reward model picks the best one, but the model never learns *why* one was better. With TPO, the model explicitly analyzes the 'chosen' vs. 'rejected' response to generate a critique (e.g., 'Response A is safer because it avoids X'), then uses that critique to generate a superior response.
Key Novelty
Textual Gradient Descent
  • Translates the mathematical concept of gradient descent into natural language processing: 'Loss' becomes a textual critique, and 'Gradient' becomes textual instructions for improvement
  • Uses the LLM's own reasoning to interpret the gap between a high-scoring and low-scoring response, explicitly formulating a strategy to improve the output
  • Performs optimization in the space of 'context' (the prompt/history) rather than the space of model parameters (weights)
Architecture
Architecture Figure Figure 2
The iterative pipeline of Test-Time Preference Optimization (TPO), drawing an analogy to gradient descent.
Evaluation Highlights
  • An unaligned Llama-3.1-70B-SFT model using TPO surpasses the performance of its standard RLHF-aligned counterpart (Llama-3.1-70B-Instruct) across nearly all tested benchmarks
  • Applying TPO to an aligned 22B parameter model achieves 53.4% LC score on AlpacaEval 2, outperforming well-established leaderboard entries
  • Applying TPO to an aligned 22B parameter model achieves 72.2% WR (Win Rate) score on Arena-Hard
Breakthrough Assessment
8/10
Offers a compelling, interpretable alternative to training-time alignment (RLHF/DPO) and demonstrates that unaligned models can beat aligned ones purely through inference-time compute. High practical value for adaptability.
×