← Back to Paper List

GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang
Peking University, Meituan
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

Post-training optimization Large Reasoning Models (LRMs)
GIFT reformulates supervised fine-tuning as a finite-temperature Gibbs distribution to preserve base model capabilities and prevent the exploration collapse typical of standard zero-temperature training.
Core Problem
Standard Supervised Fine-Tuning (SFT) forces models to imitate expert data deterministically (zero-temperature limit), causing the probability distribution to collapse and destroying the exploration space needed for subsequent Reinforcement Learning (RL).
Why it matters:
  • Distributional collapse erodes structural priors from pre-training, making the model rigid and unable to explore diverse reasoning paths during RL.
  • The mismatch between SFT's imitation objective and RL's exploration objective creates a bottleneck where the model forgets general knowledge and overfits to specific templates.
Concrete Example: In standard SFT, if an expert solves a math problem using Method A, the model suppresses all probability for valid Method B. When RL starts, the model cannot explore Method B to see if it yields higher rewards, effectively getting stuck in a local optimum.
Key Novelty
Gibbs Initialization with Finite Temperature (GIFT)
  • Theoretical Reinterpretation: Frames standard SFT as a degenerate 'zero-temperature' case that destroys information, whereas the ideal initialization for RL is a 'finite-temperature' Gibbs distribution.
  • Practical Implementation: Instead of forcing the model to strictly copy expert tokens, GIFT incorporates supervision as a reward-weighted scaling of the base model's distribution. This boosts expert solutions while keeping the base model's alternative paths viable for future exploration.
Evaluation Highlights
  • +10% improvement on the challenging AIME benchmark (13.33% → 23.33%) using Qwen2.5-7B compared to standard SFT.
  • Achieves 59.55% average pass@1 on Qwen2.5-7B across 6 reasoning benchmarks, outperforming strong baselines like PSFT (56.33%) and LUFFY (56.69%).
  • Superior scaling: GIFT achieves a +3.8% lead over standard SFT at pass@8 on Qwen2.5-7B, proving it preserves a more diverse and effective search space for RL.
Breakthrough Assessment
8/10
Offers a mathematically principled correction to the standard SFT-then-RL pipeline. The theoretical insight (SFT as zero-temp limit) is elegant, and empirical gains on hard reasoning tasks are substantial.
×