← Back to Paper List

The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

Ke Ji, Jiahao Xu, Tian Liang, Qiuzhi Liu, Zhiwei He, Xingyu Chen, Xiaoyuan Liu, Zhijie Wang, Junying Chen, Benyou Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Tencent, The Chinese University of Hong Kong, Shenzhen, Harbin Institute of Technology, Institute of Automation, Chinese Academy of Sciences
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

LLM Reasoning Unsupervised Fine-Tuning
UPFT improves LLM reasoning by fine-tuning on just the first few tokens (prefixes) of model-generated solutions, leveraging the consistency of early reasoning steps without needing ground-truth labels.
Core Problem
Improving LLM reasoning typically requires expensive supervised fine-tuning with labeled data or computationally heavy rejection sampling (generating many solutions and filtering for correctness), which is infeasible when ground truth is unavailable.
Why it matters:
  • Reasoning tasks like math often rely on scarce human-annotated data or expensive verification pipelines.
  • Existing self-improvement methods (RFT, STaR) require generating many candidate solutions and filtering them against known answers, consuming massive compute resources.
  • Unsupervised methods are needed for domains where reliable ground-truth labels or verifiers do not exist.
Concrete Example: In math problems, incorrect solutions often start with valid reasoning steps but diverge later. Standard rejection sampling discards these trajectories entirely if the final answer is wrong, wasting the valid initial logic. UPFT learns from the shared initial steps (prefixes) of all generated traces, regardless of the final answer's correctness.
Key Novelty
Unsupervised Prefix Fine-Tuning (UPFT)
  • Leverages 'Prefix Self-Consistency': observation that correct and incorrect reasoning paths often share identical initial steps (prefixes).
  • Fine-tunes the model only on these short initial prefixes (e.g., first 64 tokens) of generated solutions without checking correctness, assuming early steps are generally valid.
  • Prevents degradation of general capabilities by mixing in a small amount of full-sequence unsupervised fine-tuning.
Evaluation Highlights
  • Matches performance of supervised Rejection Sampling Fine-Tuning (RFT) while reducing training time by 75% and sampling cost by 99%.
  • Outperforms vanilla unsupervised fine-tuning (SFT) significantly: +5.5% on GSM8K and +2.8% on MATH with Llama-3-8B-Instruct.
  • Achieves 48.4% on MATH using Qwen-Math-7B-Instruct, comparable to RFT (48.8%) but using only 1 sample per question instead of 64.
Breakthrough Assessment
8/10
Highly efficient method that challenges the assumption that full-trace verification is needed for reasoning improvement. Drastic reduction in compute/data costs while matching supervised baselines.
×