Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

📝 Paper Summary

Mathematical Reasoning Large Language Model Fine-tuning Reinforcement Learning from Verifiable Rewards (RLVR)

OXA fine-tunes language models on offline reasoning trajectories by promoting low-confidence correct answers and suppressing high-confidence errors, creating a high-entropy initialization that enhances subsequent reinforcement learning for math reasoning.

Core Problem

Standard Supervised Fine-Tuning (SFT) initializes models with low policy entropy, causing them to memorize specific paths and prematurely converge during subsequent Reinforcement Learning (RL), limiting exploration.

Why it matters:

SFT is the critical starting point for RLVR; a poor initialization constrains the model's ability to discover new reasoning paths later
Existing methods focus on fixing exploration *during* RLVR, neglecting the potential to bake exploration capabilities into the SFT stage itself
RLVR excels at optimizing known paths but struggles to expand the fundamental reasoning space, which SFT is better suited to do

Concrete Example: When trained to convergence via standard SFT, a model's entropy collapses, making the predictive distribution sharp (peaked). Consequently, during RLVR, the model repeatedly samples the same high-probability reasoning path for a math problem, failing to explore alternative valid derivations that might lead to better generalization.

Key Novelty

Offline eXploration-Aware (OXA) Fine-Tuning

Counteracts entropy collapse by flattening the predictive distribution: it boosts the probability of valid reasoning paths that the model is currently unsure about (low-confidence truths)
Simultaneously reduces the probability of incorrect paths that the model is overly sure about (high-confidence errors), redistributing mass to potentially correct alternatives
Uses a Gaussian-guided sampling algorithm to select training data based on perplexity, ensuring a mix of difficulty levels rather than just the easiest or hardest samples

Architecture

Conceptual illustration of entropy dynamics. Upper: Standard SFT leads to entropy collapse (sharp distribution). Lower: OXA maintains high entropy by promoting probability at troughs (low-confidence truths) and suppressing peaks (high-confidence errors).

Evaluation Highlights

Achieves average gain of +6.6 Pass@1 and +5.5 Pass@k points compared to conventional SFT on Qwen2.5-1.5B-Math across 6 benchmarks
Maintains significantly higher initial policy entropy compared to standard SFT, which persists throughout subsequent RLVR training
Gains are additive when combined with RLVR-enhancement methods, proving orthogonality to existing RL techniques

Breakthrough Assessment

8/10

Offers a distinct perspective by shifting the exploration problem from the RL stage to the SFT initialization stage. The consistent, significant gains across multiple models and benchmarks demonstrate strong practical value.

⚙️ Technical Details

Problem Definition

Setting: Two-stage training pipeline for long-chain mathematical reasoning: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Verifiable Rewards (RLVR).

Inputs: Mathematical query q

Outputs: Reasoning chain and final answer

Pipeline Flow

Teacher Distillation & Verification
PPL Calculation & Data Selection
OXA Fine-Tuning (SFT Phase)
RLVR Training (RL Phase)

System Modules

Data Selector

Selects trajectories for promotion or suppression based on correctness and perplexity

Model or implementation: None (Algorithm)

OXA Fine-Tuner

Updates model weights to increase probability of low-confidence truths and decrease probability of high-confidence errors

Model or implementation: Qwen2.5-Math (1.5B / 7B)

Novel Architectural Elements

Dual-objective SFT loss integration: Combining promotion of low-confidence correct data (via MLE) with suppression of high-confidence incorrect data (via Unlikelihood) specifically for SFT initialization
Gaussian-guided PPL sampling strategy for curriculum-like data selection in SFT

Modeling

Base Model: Qwen2.5-1.5B-Math and Qwen2.5-7B-Math

Training Method: Offline Exploration-Aware (OXA) Fine-Tuning followed by RLVR (GRPO)

Objective Functions:

Purpose: Promote low-confidence correct reasoning paths.

Formally: Minimize Cross-Entropy Loss L_MLE on selected low-confidence verified trajectories.
Purpose: Suppress high-confidence incorrect reasoning paths.

Formally: Minimize Token-level Unlikelihood Loss L_UL on selected high-confidence incorrect trajectories.
Purpose: Combined global objective.

Formally: L_OXA = L_MLE + alpha * L_UL

Adaptation: Full fine-tuning

Training Data:

Utilizes NuminaMath-CoT dataset
Teacher models (DeepSeek-V2-Coder-Instruct) generate responses for offline selection

Key Hyperparameters:

alpha (unlikelihood weight): Small (to prevent gradient instability, exact value not in text snippet)
mu (Gaussian mean for PPL): Controls center of difficulty selection
sigma (Gaussian std for PPL): Controls diversity of difficulty selection

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT: OXA selectively targets low-confidence data and explicitly suppresses errors to maintain entropy, whereas SFT collapses entropy.
vs. RFT: OXA uses teacher-distilled data for capability expansion (learning new paths) rather than just self-generated paths.
vs. Entropy-regularized RL (e.g., MaxEnt RL) [not cited in paper]: OXA addresses entropy at the initialization (SFT) stage, whereas MaxEnt RL addresses it during the RL loop.

Limitations

Relies on offline data curation which requires a teacher model and verification pipeline
Unlikelihood training can be unstable if the weighting parameter alpha is too large
Requires verifiable rewards (math problems), may not generalize easily to open-ended tasks without clear correctness checks

Reproducibility

Code: https://github.com/takagi97/OXA-Fine-tuning

Code is publicly available at https://github.com/takagi97/OXA-Fine-tuning. Hyperparameters for the Gaussian sampling and alpha weighting are mentioned as critical but exact values are in Appendix (not provided in text).

📊 Experiments & Results

Evaluation Setup

Evaluation of mathematical reasoning on 6 benchmarks using SFT-then-RLVR paradigm.

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging competition math)
AIME (High-school math competition)
AMC (American Math Competitions)
OlympiadBench (Olympiad-level math)
GaoKao (Chinese college entrance exam math)

Metrics:

Pass@1
Pass@k
Policy Entropy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on Qwen2.5-1.5B-Math show OXA significantly outperforms conventional SFT across averaged benchmarks.
Average (6 benchmarks)	Pass@1	Not reported in the paper	Not reported in the paper	+6.6
Average (6 benchmarks)	Pass@k	Not reported in the paper	Not reported in the paper	+5.5

Main Takeaways

OXA consistently improves mathematical reasoning performance across diverse benchmarks (GSM8K, MATH, AIME, etc.) compared to standard SFT.
The method successfully mitigates entropy collapse: OXA-trained models exhibit higher policy entropy than SFT models, indicating a broader exploration space.
Performance gains from OXA are persistent; they are maintained throughout the subsequent extensive RLVR training phase.
The approach is effective across different model scales (tested on 1.5B and 7B parameters).

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Reinforcement Learning from Verifiable Rewards (RLVR)
Perplexity (PPL) as a measure of model confidence
Policy Entropy
Maximum Likelihood Estimation (MLE)
Unlikelihood Training

Key Terms

RLVR: Reinforcement Learning from Verifiable Rewards—using outcome-based rewards (like correct final answers in math) to train models via RL

Policy Entropy: A measure of the randomness or spread of the model's token predictions; higher entropy means the model is more likely to explore diverse paths

Perplexity (PPL): A metric used here to quantify the model's confidence in a reasoning trajectory; high PPL means low confidence

Unlikelihood Loss: A loss function that decreases the probability of specified tokens (used here to suppress high-confidence errors)

Pass@k: An evaluation metric measuring the probability that at least one of k generated solutions is correct

SFT-then-RLVR: A standard training paradigm where a model is first fine-tuned on labeled data (SFT) and then further optimized using reinforcement learning (RLVR)