ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

📝 Paper Summary

Supervised Fine-Tuning (SFT) Token-level training objectives Data-Efficient Instruction Tuning

ProFit modifies the Supervised Fine-Tuning loss by masking low-probability tokens during training, limiting optimization to high-confidence 'core' tokens that carry essential reasoning logic rather than stylistic noise.

Core Problem

Traditional SFT forces models to strictly align with a single reference answer, treating valid paraphrases as errors and causing overfitting to non-essential stylistic tokens rather than core reasoning logic.

Why it matters:

Language is inherently one-to-many; penalizing valid variations degrades generalization and leads to rote memorization
Multi-reference SFT is prohibitively expensive to annotate and often suffers from optimization instability due to conflicting gradients
Standard SFT can degrade performance on complex reasoning tasks compared to base models by introducing noise from non-core expressions

Concrete Example: If a reference answer uses the phrase 'result is' but the model predicts 'answer is' (a valid synonym), standard SFT penalizes this valid prediction heavily because it doesn't match the specific reference token. ProFit identifies 'result' as a low-probability, non-essential token and masks it, preventing this harmful penalty.

Key Novelty

Probability-Guided Hard Masking

Uses the model's own online prediction probability as a proxy for semantic importance: high-probability tokens are treated as 'core' logic, while low-probability tokens are treated as 'trivial' noise
Applies a hard binary mask to the loss function: only tokens with probability above a threshold contribute to gradient updates, effectively filtering out stylistic variations dynamically

Architecture

Illustration of the ProFit strategy compared to Traditional SFT. It shows how ProFit filters out low-value signals (trivial tokens) from a single reference answer.

Evaluation Highlights

+10.94% average accuracy improvement on Qwen3-4B-Base compared to standard SFT across 5 benchmarks (52.33% vs 41.39%)
Reverses performance degradation on Qwen3-14B-Base: where SFT drops -1.88% vs Vanilla, ProFit achieves +5.64% gain
Consistently outperforms probability-aware baselines like DFT (+1.29% on Qwen3-0.6B) and entropy-based tuning on reasoning and math tasks

Breakthrough Assessment

7/10

Simple, intrinsic, and effective method that addresses a fundamental flaw in SFT (one-to-one mapping). Shows significant gains on reasoning tasks without external models, though relies on the heuristic that probability equals importance.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of Large Language Models (LLMs) on instruction-response pairs

Inputs: Input instruction x and reference response y*

Outputs: Fine-tuned policy pi_theta

Pipeline Flow

Input Processing (Tokenization)
LLM Forward Pass (Predict Logits)
Generation (Sampling)

System Modules

Tokenizer

Converts input text into token IDs

Model or implementation: Model-specific tokenizer (e.g., Qwen3 tokenizer)

LLM Backbone

Predicts next-token probabilities

Model or implementation: Various (Qwen3, Llama-3, OLMo-2)

Modeling

Base Model: Qwen3 (0.6B, 4B, 14B), Llama-3 (8B), OLMo-2 (7B, 13B)

Training Method: ProFit (Probability-Guided Token Selection SFT)

Objective Functions:

Purpose: Minimize negative log-likelihood only for high-probability tokens to avoid overfitting to noise.

Formally: L_ProFit = - (1 / sum(M_t)) * sum(M_t * log(p_{t, y^*_t})) where M_t = I[sg(p_{t, y^*_t}) > tau]

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA adapters (B and A matrices)

Training Data:

2,000 samples from BAAI-InfinityInstruct Dataset
Prioritized high reward scores (following Shadow-FT methodology)

Compute: 8 H20 GPUs

Comparison to Prior Work

vs. Rho-1: ProFit uses the model's own online probabilities (intrinsic) rather than a costly external reference model
vs. DFT: ProFit uses strict hard masking instead of soft reweighting, which the authors argue is more efficient for filtering non-core expressions
vs. Standard SFT: Selectively masks low-probability tokens instead of training on all tokens indiscriminately

Limitations

Relies on the heuristic that low-probability tokens are always unimportant, which might not hold for hard-to-learn but correct facts
Threshold sensitivity: The static threshold tau needs to be tuned (though specific sensitivity analysis is not in the provided text)
Depends on the base model having reasonable initial calibration to identify core tokens via probability

Reproducibility

Code: https://github.com/Utaotao/ProFit

📊 Experiments & Results

Evaluation Setup

Evaluation on general reasoning, mathematics, and instruction following benchmarks

Benchmarks:

GPQA-Diamond (General Reasoning)
MATH-500 (Mathematics)
GSM8K (Mathematics)
AIME'24 (Mathematics Competition)
IFEval (Instruction Following)

Metrics:

Average Accuracy
Statistical methodology: Hypothesis testing reported for token distribution analysis (p-value 1e-6); significance tests for benchmark results not explicitly reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison across Qwen3 model family showing ProFit consistently outperforming Standard SFT.
Average (5 benchmarks)	Accuracy	41.39	52.33	+10.94
Average (5 benchmarks)	Accuracy	26.70	27.04	+0.34
Average (5 benchmarks)	Accuracy	30.20	31.49	+1.29

Experiment Figures

Probability distributions of 'core tokens' vs 'trivial tokens' as annotated by Gemini-3-pro.

Main Takeaways

ProFit consistently outperforms standard SFT and other selection baselines (DFT, Entropy) across multiple model sizes (0.6B to 14B) and families (Qwen, Llama, OLMo).
Standard SFT can degrade performance compared to the Vanilla base model on larger models (e.g., Qwen3-14B), while ProFit reverses this trend and achieves positive gains.
Masking low-probability tokens effectively filters out 'trivial' stylistic variations, preventing the model from overfitting to the specific surface form of the reference answer.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Autoregressive Language Modeling
Familiarity with Supervised Fine-Tuning (SFT) and Cross-Entropy Loss
Basic knowledge of Gradient Descent and Backpropagation

Key Terms

SFT: Supervised Fine-Tuning—retraining a pre-trained model on labeled instruction-response pairs to align it with human intent

One-to-many nature: The linguistic property where a single intent or meaning can be validly expressed by multiple different token sequences

Logits: The raw, unnormalized output scores from the model's last layer, before being converted to probabilities via softmax

Stop-gradient: An operator in computational graphs that prevents error gradients from flowing back through a specific variable during training updates

Jacobian: A matrix of all first-order partial derivatives of a vector-valued function; here representing how model outputs change with respect to parameters

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning method that updates only small low-rank matrices added to the original weights

Hard masking: A binary filtering technique where data points (tokens) are either fully included or fully excluded from the loss calculation based on a threshold

DFT: Dynamic Fine-Tuning—a baseline method that reweights token losses based on confidence rather than masking them entirely