GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization

📝 Paper Summary

Post-training optimization Large Reasoning Models (LRMs)

GIFT reformulates supervised fine-tuning as a finite-temperature Gibbs distribution to preserve base model capabilities and prevent the exploration collapse typical of standard zero-temperature training.

Core Problem

Standard Supervised Fine-Tuning (SFT) forces models to imitate expert data deterministically (zero-temperature limit), causing the probability distribution to collapse and destroying the exploration space needed for subsequent Reinforcement Learning (RL).

Why it matters:

Distributional collapse erodes structural priors from pre-training, making the model rigid and unable to explore diverse reasoning paths during RL.
The mismatch between SFT's imitation objective and RL's exploration objective creates a bottleneck where the model forgets general knowledge and overfits to specific templates.

Concrete Example: In standard SFT, if an expert solves a math problem using Method A, the model suppresses all probability for valid Method B. When RL starts, the model cannot explore Method B to see if it yields higher rewards, effectively getting stuck in a local optimum.

Key Novelty

Gibbs Initialization with Finite Temperature (GIFT)

Theoretical Reinterpretation: Frames standard SFT as a degenerate 'zero-temperature' case that destroys information, whereas the ideal initialization for RL is a 'finite-temperature' Gibbs distribution.
Practical Implementation: Instead of forcing the model to strictly copy expert tokens, GIFT incorporates supervision as a reward-weighted scaling of the base model's distribution. This boosts expert solutions while keeping the base model's alternative paths viable for future exploration.

Evaluation Highlights

+10% improvement on the challenging AIME benchmark (13.33% → 23.33%) using Qwen2.5-7B compared to standard SFT.
Achieves 59.55% average pass@1 on Qwen2.5-7B across 6 reasoning benchmarks, outperforming strong baselines like PSFT (56.33%) and LUFFY (56.69%).
Superior scaling: GIFT achieves a +3.8% lead over standard SFT at pass@8 on Qwen2.5-7B, proving it preserves a more diverse and effective search space for RL.

Breakthrough Assessment

8/10

Offers a mathematically principled correction to the standard SFT-then-RL pipeline. The theoretical insight (SFT as zero-temp limit) is elegant, and empirical gains on hard reasoning tasks are substantial.

⚙️ Technical Details

Problem Definition

Setting: Two-stage post-training: Initialization (SFT) followed by Reinforcement Learning (RL)

Inputs: Pre-trained base model π_base and expert demonstration dataset D = {(x, y*)}

Outputs: Optimized policy π_θ capable of complex reasoning

Pipeline Flow

GIFT Initialization (replaces Standard SFT)
Reinforcement Learning (RL)

System Modules

GIFT Optimization

Initialize the policy by minimizing KL divergence with a principled Gibbs target distribution rather than minimizing standard cross-entropy

Model or implementation: Qwen2.5-7B or Llama-3.1-8B

RL Optimization

Maximize expected reward using GRPO

Model or implementation: Initialized Policy π_sft

Modeling

Base Model: Qwen2.5-7B and Llama-3.1-8B

Training Method: GIFT (Initialization) followed by GRPO (RL)

Objective Functions:

Purpose: Initialize policy using finite-temperature Gibbs target.

Formally: π*_sft(y|x) ∝ π_base(y|x) * exp(β * R(x,y)), where β = η - λ is the finite inverse temperature.
Purpose: Maximize expected reward with KL regularization during RL.

Formally: J_RL(θ) = E[R(x,y) - η * KL(π_θ || π_sft)]

Adaptation: Full fine-tuning

Training Data:

DeepMath-103k subset: 10k samples for SFT/GIFT, 10k distinct samples for RL, 1k for validation

Key Hyperparameters:

learning_rate: 1e-6 (for RL)
ppo_clip_ratio: 0.2
group_size_G: 8
+ 3 more
max_response_length: 8192
beta_initialization: Depends on model (e.g., β≈20 for Qwen, β≈10 for Llama)
kl_coefficient_beta_rl: Not explicitly detailed, implied standard GRPO settings

Compute: Training performed on 8× NVIDIA H200 GPUs for one epoch

Comparison to Prior Work

vs. Standard SFT: GIFT uses finite temperature β to preserve base probabilities instead of collapsing to Dirac delta.
vs. LUFFY/ReLIFT: GIFT fits into the standard sequential SFT-then-RL pipeline rather than requiring complex unified optimization loops.
vs. Label Smoothing/KD [not cited in paper as primary baseline]: GIFT weights are data-dependent and reward-calibrated (Gibbs), not uniform smoothing or simple distillation.
+ 1 more
vs. DFT/ASFT: GIFT derives its objective from the theoretical closed-form solution of the global post-training objective.

Limitations

Optimal inverse temperature β is model-dependent (sensitive hyperparameter).
Requires access to base model probabilities during initialization, which increases memory overhead compared to standard SFT.
Evaluated primarily on math and code reasoning; applicability to open-ended creative writing or chat is less explored.

Reproducibility

Code: https://github.com/zzy1127/GIFT

Publicly available code at https://github.com/zzy1127/GIFT. DeepMath-103k dataset is public. Specific hyperparameters for β (inverse temperature) are analyzed in Figure 1, showing optimal values differ by backbone.

📊 Experiments & Results

Evaluation Setup

SFT followed by RL (GRPO) on mathematical reasoning tasks

Benchmarks:

GSM8K (Grade-school math reasoning)
MATH-500 (Challenging math problems (subset of MATH))
AIME 24 & 25 (High-difficulty math competitions)
GPQA (Graduate-level reasoning (OOD))
HumanEval-plus / MBPP-plus (Code generation)

Metrics:

Pass@1
Pass@32 (for AIME on Llama-3.1-8B)
Pass@k scaling (k=1,2,4,8)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reasoning performance on Qwen2.5-7B shows GIFT consistently outperforming standard SFT and advanced baselines across math and code benchmarks.
Average (6 Reasoning Benchmarks)	Pass@1	55.06	59.55	+4.49
AIME Average	Pass@1	13.33	23.33	+10.00
Average (6 Reasoning Benchmarks)	Pass@1	37.46	42.96	+5.50
Out-of-Distribution (OOD) generalization results demonstrate GIFT resists catastrophic forgetting better than SFT.
Average (4 OOD Benchmarks)	Accuracy	59.78	64.10	+4.32
Exploration potential analysis (Pass@k scaling) prior to RL training.
Mathematical Reasoning	Pass@8	59.01	62.81	+3.80

Main Takeaways

GIFT significantly outperforms standard SFT and other robust baselines (PSFT, ASFT) as an initialization for RL, particularly on hard reasoning tasks like AIME.
By maintaining a finite temperature, GIFT preserves the 'exploration landscape' of the base model, preventing the mode collapse seen in standard SFT where P(non-expert tokens) → 0.
GIFT shows superior geometric and distributional consistency (lower KL divergence, higher cosine similarity) with the base model, proving it acts as a 'bridge' rather than a destructive overwrite.
The method is robust to OOD tasks, showing it retains general capabilities better than standard SFT, which tends to overfit the reasoning demonstrations.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with KL regularization
Gibbs (Boltzmann) distribution
Supervised Fine-Tuning (SFT) vs. RLVR (Reinforcement Learning with Verifiable Rewards)

Key Terms

SFT: Supervised Fine-Tuning—training a model to mimic expert demonstrations using cross-entropy loss

RLVR: Reinforcement Learning with Verifiable Rewards—an RL phase using ground-truth verifiers (e.g., math answers) to optimize reasoning

Gibbs distribution: A probability distribution where the probability of a state is proportional to the exponential of its energy (or reward) divided by temperature

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt

OOD: Out-of-Distribution—tasks or data significantly different from the training set, used to test generalization

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution

Inverse temperature (β): A parameter controlling the sharpness of a distribution; high β (low temperature) makes the distribution peaky (deterministic), while low β flattens it

Pass@k: An evaluation metric measuring the probability that at least one of the k generated solutions is correct