Learning from Diverse Reasoning Paths with Routing and Collaboration

📝 Paper Summary

Knowledge Distillation Reasoning

QR-Distill improves knowledge distillation by filtering teacher reasoning paths for quality, adaptively routing them to students based on compatibility, and enabling peer collaboration between students.

Core Problem

Distilling reasoning capabilities from powerful teachers is limited by standard token-level supervision and the variable quality of reasoning paths, which may be incorrect or unsuitable for specific student models.

Why it matters:

Standard black-box distillation supervises only a narrow slice of the teacher's distribution, failing to capture full reasoning nuances
Simply using all available teacher paths is suboptimal because some are incorrect, spurious, or mismatched to a student's architecture
Deploying large proprietary models is resource-intensive; effective distillation to smaller models is crucial for efficiency

Concrete Example: A program-style explanation might benefit a model strong in algorithmic reasoning but confuse a model optimized for natural language, while a long multi-hop chain might overcomplicate a simple question.

Key Novelty

Quality-filtered Routing with Cooperative Distillation (QR-Distill)

Selects high-quality reasoning paths by verifying final answers and using an LLM-as-judge to remove spurious steps
Uses a trainable router to dynamically assign specific reasoning paths to the student model most likely to benefit from them
Enables students to act as peer teachers by sharing knowledge through a soft ensemble representation, bridging gaps in individual coverage

Architecture

The overall framework of QR-Distill, illustrating the flow from path generation to filtering, routing, and cooperative distillation.

Evaluation Highlights

Achieves an average improvement of 41.44% (Mistral) and 63.33% (Gemma) over zero-shot performance across diverse reasoning benchmarks
Outperforms single-path distillation baselines by 24.32% on average, confirming the value of diverse reasoning paths
Surpasses multi-path baselines without routing/collaboration by up to 13.36%, demonstrating the efficacy of adaptive assignment and peer learning

Breakthrough Assessment

7/10

Strong empirical gains and a novel combination of routing and peer distillation. Addresses a clear bottleneck in multi-path distillation (path quality/suitability).

⚙️ Technical Details

Problem Definition

Setting: Black-box knowledge distillation where a student model learns from a dataset augmented with multiple reasoning paths generated by a teacher

Inputs: Question Q and a set of diverse reasoning paths R generated by teacher T

Outputs: Trained student model S capable of generating correct reasoning and answers

Pipeline Flow

Reasoning Path Generation (Teacher T generates k paths)
Quality Filtering (Remove incorrect/spurious paths)
Conditional Routing (Assign paths to Students)
Mutual-Student Distillation (Students learn from each other)

System Modules

Reasoning Path Generator

Generate diverse reasoning paths using various prompt templates (CoT, ToT, Program, etc.)

Model or implementation: Gemini-1.5-Pro-001 (Black-box Teacher)

Quality Filter

Remove paths with wrong answers and spurious logic

Model or implementation: LLM-as-judge (implied Gemini or similar)

Router

Assign paths to students based on predicted learning gain

Model or implementation: RoBERTa-base encoder + MLP + Gumbel-Softmax

Student Models

Learn from assigned paths and peer representations

Model or implementation: Mistral-7B-Instruct-v0.3 and Gemma-7B-Instruct (Students)

Novel Architectural Elements

Conditional routing module using Gumbel-Softmax to differentiably assign training data samples to specific student models within a distillation framework
Mutual-student distillation block that projects hidden states to a shared space and computes a soft ensemble target for peer learning

Modeling

Base Model: Mistral-7B-Instruct-v0.3 and Gemma-7B-Instruct

Training Method: Supervised Fine-Tuning (SFT) + Mutual Distillation

Objective Functions:

Purpose: Standard distillation loss.

Formally: SFT loss on assigned reasoning paths.
Purpose: Prevent router collapse.

Formally: Entropy maximization of average routing assignment.
Purpose: Transfer knowledge between students.

Formally: MSE loss between student projection and ensemble representation.

Adaptation: QLoRA (rank 32)

Trainable Parameters: LoRA adapters + Router MLP

Key Hyperparameters:

learning_rate_mistral: 5e-6
learning_rate_gemma: 2e-4
batch_size_per_device: 8
+ 3 more
epochs_math: 3
epochs_other: 10
lora_rank: 32

Compute: Not reported in the paper

Comparison to Prior Work

vs. SKD: QR-Distill uses multiple diverse paths and filters/routes them, rather than a single static path
vs. Answer Augmentation: QR-Distill adds routing and peer collaboration, rather than just feeding all paths to the student
vs. Reverse-Engineering/Backward Reasoning [not cited in paper]: QR-Distill incorporates backward reasoning as one of several path types, but selects among them dynamically

Limitations

Depends on a powerful proprietary teacher (Gemini-1.5-Pro) for path generation and filtering
Routing mechanism adds complexity compared to simple multi-path training
Mutual distillation requires training multiple student models simultaneously, increasing resource usage

Reproducibility

Code: https://github.com/LzyFischer/Distill

Code available at https://github.com/LzyFischer/Distill. Teacher is Gemini-1.5-Pro-001 (proprietary). Reasoning path prompt templates are provided in Figure 2.

📊 Experiments & Results

Evaluation Setup

Zero-shot reasoning evaluation on held-out test sets

Benchmarks:

StrategyQA (SQA) (Commonsense Reasoning)
ARC-Challenge (ARC) (Commonsense Reasoning)
MATH (Mathematical Reasoning)
ANLI (Natural Language Inference)
Date (Logical Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against single-path and multi-path baselines showing QR-Distill's superiority.
MATH	Accuracy	15.90	19.38	+3.48
Date	Accuracy	45.07	56.40	+11.33
ARC	Accuracy	78.41	83.62	+5.21
Ablation study demonstrating the contribution of individual components (Quality Filtering, Routing, Collaboration).
MATH	Accuracy	18.36	19.38	+1.02
ANLI	Accuracy	48.07	52.83	+4.76

Experiment Figures

Distribution of reasoning path types (CoT, ToT, Program, etc.) selected by the router for different datasets and students.

Performance of QR-Distill vs. SFT on the Date dataset with varying amounts of training data.

Main Takeaways

QR-Distill consistently outperforms both single-path and unrouted multi-path baselines across all datasets
Quality filtering is critical, especially for NLI tasks where spurious reasoning can be harmful
Weaker students (Gemma) benefit more from the mutual distillation/collaboration component than stronger students (Mistral)
Routing analysis reveals that different students prefer different reasoning styles for the same question, and preferences shift based on question difficulty

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation
Chain-of-Thought (CoT) Prompting
Reinforcement Learning/Gumbel-Softmax (for routing)
LoRA (Low-Rank Adaptation)

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

LLM-as-judge: Using a Large Language Model to evaluate the quality or correctness of outputs from another model

Gumbel-Softmax: A method to approximate sampling from a categorical distribution in a differentiable way, allowing backpropagation through discrete choices

QLoRA: Quantized Low-Rank Adaptation—a memory-efficient fine-tuning technique for large language models

SFT: Supervised Fine-Tuning—training a model on labeled examples

ToT: Tree-of-Thought—a prompting method that explores multiple reasoning paths

RoBERTa: A robustly optimized BERT pretraining approach, used here as an encoder for routing

SKD: Symbolic Knowledge Distillation—a baseline method training on teacher-generated CoTs

Entropy regularization: Adding a term to the loss function that encourages the probability distribution (here, routing decisions) to be more spread out, preventing collapse to a single option