← Back to Paper List

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

Y Wang, X Wang, C Wang, J Fang, Q Wang, J Chu…
arXiv, 8/2025 (2025)
RL Reasoning Benchmark

📝 Paper Summary

Self-Improvement Direct Preference Optimization (DPO) Reward Modeling
Temporal Self-Rewarding prevents the collapse of learning signals in iterative self-improvement by anchoring rejected responses to past models and guiding chosen responses with future model predictions.
Core Problem
In standard Self-Rewarding loops, the representations of chosen and rejected responses become increasingly similar over iterations, causing the DPO gradient to vanish and learning to stall.
Why it matters:
  • Self-improvement paradigms are crucial for scaling LLMs beyond limited human-annotated data
  • Current self-rewarding methods suffer from diminishing returns because the model's ability to distinguish good from bad deteriorates as it improves
  • Gradient collapse wastes computational resources and limits the ceiling of autonomous model refinement
Concrete Example: As shown in Figure 1, in standard Self-Rewarding, the score gap between chosen and rejected responses shrinks by 9x over iterations. This means the model can no longer distinguish 'better' from 'worse', effectively killing the optimization signal.
Key Novelty
Temporal Decoupling of Preference Pairs
  • Anchored Rejection: Instead of using the current model's weak outputs as negative examples, the system persistently uses outputs from the initial (past) model to ensure a stable 'bad' baseline.
  • Future-Guided Chosen: The system trains a temporary 'future' model on the anchored data to generate superior positive examples, which are then used to teach the current model.
  • This push-pull dynamic (pulling away from the past, pushing toward the future) maintains a large quality gap, preserving strong gradients.
Evaluation Highlights
  • +9.75% win rate improvement on AlpacaEval 2.0 with Llama3.1-8B (29.44% vs. 19.69% for standard Self-Rewarding)
  • +12.9 score improvement on Arena-Hard-v0.1 with Qwen2.5-7B (34.4 vs. 21.5 for standard Self-Rewarding)
  • Strong generalization to out-of-distribution tasks: +2.66% accuracy on TruthfulQA compared to the best Self-Rewarding baseline
Breakthrough Assessment
8/10
Identifies a fundamental theoretical flaw in self-rewarding loops (gradient collapse) and provides a highly effective, compute-neutral solution that significantly boosts performance across multiple benchmarks.
×