← Back to Paper List

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Y Tong, S Wang, D Li, Y Wang, S Han, Z Lin, C Huang…
University of California, San Diego, University of Southern California, University of Pennsylvania, Yale University, Washington University in St. Louis
arXiv, 5/2024 (2024)
Reasoning RL Benchmark

📝 Paper Summary

Weak-to-strong generalization Self-improvement Reasoning Benchmarks
Self-Reinforcement iteratively improves LLM reasoning by fine-tuning on a small seed set, then training on the preference difference between the fine-tuned model and the weaker base model on unlabeled data.
Core Problem
Enhancing LLM reasoning typically relies on large-scale, fully annotated datasets by human experts, which is not scalable as models and data requirements grow.
Why it matters:
  • Scaling laws indicate increasing demand for updated annotated questions, creating a bottleneck of human effort and time
  • Humans may struggle to provide confident answers for extremely hard questions, limiting supervision for superalignment
  • Existing benchmarks often lack unannotated questions needed to explore semi-supervised or weak-to-strong learning
Concrete Example: Current methods like PPO often require a large corpus of human-annotated golden references to distinguish correct reasoning. If a model generates a valid but novel solution to a complex brainteaser not in the dataset, standard supervised methods might penalize it or fail to learn from it due to lack of ground truth.
Key Novelty
Self-Reinforcement with Weak Supervision
  • Iterative improvement cycle: fine-tune a base model on small seed data (SFT), then use the SFT model's outputs on unlabeled data as 'strong' targets compared to the base model's 'weak' outputs.
  • Uses Direct Preference Optimization (DPO) to learn from the relative quality difference between the SFT model and the base model, rather than relying solely on absolute ground truth.
  • Self-filtering mechanism where the model evaluates its own generated pairs (SFT vs. Base) to retain only instances where the SFT response is clearly superior.
Breakthrough Assessment
6/10
Proposes a logical weak-to-strong pipeline and a new diverse benchmark (PuzzleBen). However, the paper lacks concrete experimental results (tables/numbers) to validate the method's effectiveness.
×