Optimizing Language Model's Reasoning Abilities with Weak Supervision

📝 Paper Summary

Weak-to-strong generalization Self-improvement Reasoning Benchmarks

Self-Reinforcement iteratively improves LLM reasoning by fine-tuning on a small seed set, then training on the preference difference between the fine-tuned model and the weaker base model on unlabeled data.

Core Problem

Enhancing LLM reasoning typically relies on large-scale, fully annotated datasets by human experts, which is not scalable as models and data requirements grow.

Why it matters:

Scaling laws indicate increasing demand for updated annotated questions, creating a bottleneck of human effort and time
Humans may struggle to provide confident answers for extremely hard questions, limiting supervision for superalignment
Existing benchmarks often lack unannotated questions needed to explore semi-supervised or weak-to-strong learning

Concrete Example: Current methods like PPO often require a large corpus of human-annotated golden references to distinguish correct reasoning. If a model generates a valid but novel solution to a complex brainteaser not in the dataset, standard supervised methods might penalize it or fail to learn from it due to lack of ground truth.

Key Novelty

Self-Reinforcement with Weak Supervision

Iterative improvement cycle: fine-tune a base model on small seed data (SFT), then use the SFT model's outputs on unlabeled data as 'strong' targets compared to the base model's 'weak' outputs.
Uses Direct Preference Optimization (DPO) to learn from the relative quality difference between the SFT model and the base model, rather than relying solely on absolute ground truth.
Self-filtering mechanism where the model evaluates its own generated pairs (SFT vs. Base) to retain only instances where the SFT response is clearly superior.

Breakthrough Assessment

6/10

Proposes a logical weak-to-strong pipeline and a new diverse benchmark (PuzzleBen). However, the paper lacks concrete experimental results (tables/numbers) to validate the method's effectiveness.

⚙️ Technical Details

Problem Definition

Setting: Weakly-supervised reasoning improvement using a small annotated seed set and a large unlabeled set

Inputs: A small seed dataset of labeled questions (x_g, r_g, y_g) and a larger set of unlabeled questions x_u

Outputs: An improved reasoning policy (language model) pi_theta

Pipeline Flow

Base Modeling: SFT on seed data -> SFT Model (pi_1)
Self-Filtering: Generate responses from Base (pi_0) and SFT (pi_1) on unlabeled data -> Filter for pi_1 > pi_0
Self-Reinforcement: Train pi_1 using DPO on filtered pairs to get pi_2
Iterative Loop: Repeat using previous strong model as new weak baseline

System Modules

Base Modeler

Establish initial reasoning capability via supervised fine-tuning

Model or implementation: Not explicitly specified (generic LLM pi_0)

Response Generator

Generate reasoning rationales for unlabeled questions using both weak (base) and strong (SFT) models

Model or implementation: Base Model pi_0 and SFT Model pi_1

Self-Filter

Select pairs where the SFT model's response is strictly better than the base model's

Model or implementation: SFT Model pi_1 (prompted as judge)

DPO Trainer

Optimize the model to prefer the superior rationales

Model or implementation: SFT Model pi_1 (being updated to pi_2)

Novel Architectural Elements

Iterative DPO loop where the 'weak' reference model evolves: pi_(t-2) serves as reference for pi_(t-1) to create training data for pi_t
Self-filtering mechanism using the model itself to curate its own DPO training pairs from unlabeled data

Modeling

Base Model: Not reported in the paper

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Supervised Fine-Tuning initialization.

Formally: L_SFT(theta) = - E_(x,r,y)~A^(0) [log pi_theta(r, y | x)]
Purpose: Score function for rationale quality relative to reference.

Formally: s_i(x, y, r) = log(pi_i(r, y | x) / pi_ref(r, y | x))
Purpose: DPO ranking loss to maximize likelihood of preferred rationale.

Formally: L_rank = - log(sigmoid(beta * (s_1 - s_0)))
Purpose: Iterative self-reinforcement.

Formally: L_iter = - E [log(sigmoid(beta * (s^(t-1)_i - s^(t-2)_j)))]

Training Data:

PuzzleBen Training Set: 25,147 labeled samples (seed)
PuzzleBen Unlabeled Set: 10,000 samples
PuzzleBen Test Set: 4,775 samples

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO: Uses DPO on model-generated pairs (self-comparison) rather than external reward models
vs. Standard Semi-supervised: Focuses specifically on 'reasoning' via rationale generation and preference learning rather than just pseudo-labeling
vs. Weak-to-strong (Burns et al.): Specifically applies the concept to reasoning tasks using iterative DPO

Limitations

No quantitative experimental results reported in the paper text (figures/tables present dataset stats but not performance metrics)
Base model architecture and size not specified
Relies on the assumption that SFT models strictly outperform unfinetuned models to generate valid preference pairs
Self-filtering prompt effectiveness is not quantitatively analyzed

Reproducibility

Dataset (PuzzleBen) and code are promised to be published via an Anonymity Link but no URL is currently provided. Base model architecture and specific hyperparameters (learning rate, batch size) are not reported.

📊 Experiments & Results

Evaluation Setup

Weakly-supervised reasoning on complex puzzles and riddles

Benchmarks:

PuzzleBen (Complex Reasoning (Brainteasers, Riddles, Puzzles, Parajumbles, Critical Reasoning)) [New]

Metrics:

Not explicitly reported in the paper
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of average lengths of questions and rationales between PuzzleBen and other benchmarks

Main Takeaways

The paper introduces PuzzleBen, a large-scale dataset (25k+ labeled, 10k unlabeled) focusing on lateral thinking puzzles and riddles.
PuzzleBen features longer average question and rationale lengths (348.80 and 396.37 characters respectively) compared to existing benchmarks, indicating higher complexity.
The methodology proposes using relative performance between an SFT model and a base model as a signal for self-improvement, avoiding the need for extensive human annotation.
Qualitative assertion: Experiments underscore the significance of PuzzleBen and the effectiveness of the methodology (though specific numbers are missing in the text).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF) concepts
Direct Preference Optimization (DPO)
Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) prompting

Key Terms

SFT: Supervised Fine-Tuning—training a model on a dataset of labeled input-output pairs to establish a baseline capability

DPO: Direct Preference Optimization—a method to align language models by increasing the likelihood of preferred responses over rejected ones without a separate reward model

CoT: Chain-of-Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps before the final answer

Self-reinforcement: The paper's proposed method where a model improves by learning from the preference of its fine-tuned self over its previous version

Weak-to-strong generalization: A learning paradigm where a weaker supervisor (e.g., a base model or small dataset) helps elicit capabilities in a stronger model

PPO: Proximal Policy Optimization—an RL algorithm often used for RLHF, distinct from the DPO used here