Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

📝 Paper Summary

Knowledge Distillation Chain-of-Thought (CoT) Reasoning

BRIDGE distills verbose reasoning from large teachers into compact students by establishing structural understanding via reconstruction, then optimizing brevity via reinforcement learning on masked tasks.

Core Problem

Distilling verbose Chain-of-Thought reasoning into small models fails because small models lack the capacity to memorize long sequences, leading to truncated or incoherent outputs.

Why it matters:

Deploying powerful reasoning capabilities on edge devices requires compact models (e.g., 3B parameters) that remain explicit and verifiable
Directly fine-tuning small models on teacher outputs results in repetition loops or superficial mimicry without genuine understanding due to capacity mismatch
Existing compression methods either sacrifice interpretability (implicit reasoning) or logical integrity (heuristic pruning)

Concrete Example: When a 3B model tries to learn from a 14B teacher's lengthy solution, it often produces truncated outputs or repetition loops because it cannot sustain the long-term dependencies required for the full chain.

Key Novelty

BRIDGE: Structure-Aware Curriculum for CoT Compression

Teaches structural logic first by forcing the student to reconstruct shuffled and masked teacher reasoning chains, ensuring it learns dependencies rather than just copying
Optimizes the accuracy-brevity trade-off using Group Relative Policy Optimization (GRPO) with a hierarchical reward that prioritizes correctness before efficiency
Handles difficult failure cases by providing teacher scaffolds and asking the student to rewrite them concisely, enabling internalization of complex logic

Architecture

The three-stage BRIDGE curriculum framework

Evaluation Highlights

+11.29% accuracy improvement on GSM8K using Qwen2.5-3B-Base compared to standard distillation baselines
27.4% reduction in output length while maintaining or improving reasoning accuracy
Compresses 96.83% of teacher solutions for difficult cases when provided with structural scaffolds

Breakthrough Assessment

8/10

Significantly improves small model reasoning while reducing computational cost. The curriculum approach effectively solves the capacity mismatch problem that plagues standard distillation.

⚙️ Technical Details

Problem Definition

Setting: Distilling a teacher policy into a student policy with limited capacity

Inputs: Question q sampled from dataset

Outputs: Correct but concise reasoning chain r

Pipeline Flow

Stage 1: Structure-Aware Reconstruction (SFT on shuffled/masked chains)
Stage 2: GRPO-Based Compression (RL on masked completion)
Stage 3: Teacher-Guided Internalization (RL on rewriting hard cases)

System Modules

Reconstruction Module

Learn reasoning structure by reconstructing shuffled and masked teacher outputs

Model or implementation: Qwen2.5-3B-Base

Compression Optimizer

Balance accuracy and brevity via self-exploration

Model or implementation: Student Policy (from Stage 1)

Internalization Module

Internalize logic for hard cases using teacher scaffolds

Model or implementation: Student Policy (from Stage 2)

Novel Architectural Elements

Three-stage curriculum explicitly separating structural learning, compression, and internalization
Use of shuffling and masking in SFT to force dependency learning rather than verbatim memorization

Modeling

Base Model: Qwen2.5-3B-Base

Training Method: Curriculum Learning combining SFT and GRPO

Objective Functions:

Purpose: Reconstruction loss to learn structure.

Formally: Cross-entropy on restoring ordered tokens from shuffled/masked inputs.
Purpose: Optimization objective for accuracy and brevity.

Formally: Maximize Expectation of [Reward - KL divergence], optimized via GRPO.
Purpose: Hierarchical reward to prevent reward hacking.

Formally: R = R_base + I[Correct] * gamma * (1 - |r|/|r_baseline|), ensuring efficiency is only rewarded if correct.

Training Data:

GSM8K dataset
Failure cases identified dynamically after Stage 2

Key Hyperparameters:

masking_rate: 15% of steps (Stage 1)
shuffling_rate: 100% of samples (Stage 1)
sample_mask_probability: 0.7 (Stage 1)
+ 2 more
GRPO_group_size: Not explicitly reported in the paper
beta_KL: Controls strength of KL regularization

Comparison to Prior Work

vs. Standard SFT: Uses curriculum with reconstruction tasks instead of direct memorization
vs. Step-BERT: Generative reconstruction instead of discriminative re-ordering [not cited in paper]
vs. Implicit Reasoning: Maintains explicit, verifiable text output

Limitations

Relies on the availability of a high-quality teacher model (e.g., DeepSeek-R1-14B)
Curriculum adds complexity compared to single-stage distillation
Experiments primarily focused on GSM8K; generalization to other domains not fully explored in the summary

Reproducibility

Methodology is described in detail, including reward formulation and curriculum stages. Code is not provided. Hyperparameters for masking are given.

📊 Experiments & Results

Evaluation Setup

Distilling reasoning capabilities for arithmetic problems

Benchmarks:

GSM8K (Arithmetic Reasoning)

Metrics:

Accuracy
Output Length (Tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GSM8K	Accuracy Improvement	Not explicitly reported in the paper	Not explicitly reported in the paper	+11.29%
GSM8K	Length Reduction	Not explicitly reported in the paper	Not explicitly reported in the paper	-27.4%

Main Takeaways

Direct SFT on verbose CoT harms small models due to capacity mismatch.
Structural understanding (Stage 1) is a prerequisite for effective compression.
Teacher-guided rewriting (Stage 3) effectively recovers performance on hard cases where the student initially fails.
Hierarchical rewards prevent reward hacking where models generate short but incorrect answers.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) reasoning
Knowledge Distillation
Reinforcement Learning (RLHF/PPO concepts)

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that estimates baselines from group averages of outputs for the same input, eliminating the need for a separate critic model

SFT: Supervised Fine-Tuning—training a model on labeled examples using standard cross-entropy loss

Curriculum Learning: A training strategy where the model is presented with increasingly difficult or complex tasks/objectives over time

Capacity Mismatch: The gap in representational power between a large teacher model and a small student model, making direct imitation difficult

Structure-Aware Reconstruction: A pre-training task where the model must reorder and fill in blanks in a corrupted reasoning chain to learn logical dependencies

Teacher Scaffolding: Providing the teacher's full solution as part of the prompt to guide the student during the rewriting phase