Teaching Small Language Models to Reason for Knowledge-Intensive Multi-Hop Question Answering

📝 Paper Summary

Knowledge Distillation Small Language Models (SLMs) Multi-Hop Question Answering

D&R Distillation splits reasoning into two interacting small student models—a Decomposer that asks sub-questions and a Responser that answers them using retrieval—allowing small models to solve complex multi-hop tasks efficiently.

Core Problem

Chain-of-Thought Distillation (CoTD) fails for small language models on knowledge-intensive multi-hop tasks because SLMs lack the capacity to memorize vast knowledge and struggle to learn integrated decomposition and reasoning simultaneously.

Why it matters:

Existing CoT distillation methods work well for arithmetic but fail when heavy external knowledge is required
Small models cannot effectively utilize one-step retrieval because relevance often depends on intermediate reasoning steps
Learning all sub-tasks (decomposition, retrieval, reasoning) in a single model is inefficient and requires massive training data

Concrete Example: For the question 'What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?', a standard CoT-distilled small model fails to retrieve the necessary intermediate fact ('extends into the High Plains') and cannot answer. The proposed method decomposes this into 'What does the eastern sector... extend into?' first, retrieves 'High Plains', then asks about the elevation of High Plains.

Key Novelty

Decompose-and-Response (D&R) Distillation

Decouples the reasoning process into two separate student models: a Decomposer (asks sub-questions) and a Responser (answers sub-questions using retrieval)
Transforms complex multi-hop reasoning into an interactive dialogue where the Responser provides external knowledge at every step, reducing the cognitive load on each individual small model

Architecture

The three-stage pipeline: (1) Generating data with LLM self-ask prompting, (2) Distilling the Decomposer and Responser separately, (3) Interactive inference where Decomposer asks and Responser answers.

Evaluation Highlights

Outperforms 11B FLAN-T5-XXL (few-shot) using only two 220M T5-Base models on HotpotQA and 2WikiMultiHopQA
+8.2% Answer F1 improvement over Fine-tuning baseline on 2WikiMultiHopQA with T5-Base
Achieves superior performance using only 1/10th of the training data compared to standard Chain-of-Thought Distillation (CoTD) baselines

Breakthrough Assessment

7/10

Strong practical contribution for deploying efficient small models. Effectively solves the 'knowledge gap' in distillation by architectural decomposition, though the core components (T5, BM25) are standard.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-intensive multi-hop question answering where external knowledge is required to answer complex queries

Inputs: Natural language question q

Outputs: Final answer o (derived through a sequence of sub-questions and intermediate answers)

Pipeline Flow

Decomposer (generates sub-question or final answer)
Retriever (fetches documents for sub-question)
Responser (answers sub-question using documents)
Loop back to Decomposer with interaction history

System Modules

Decomposer

Decides whether to ask a sub-question or predict the final answer based on history

Model or implementation: T5-Small/Base/Large (fine-tuned)

Retriever

Retrieves relevant background knowledge for the specific sub-question

Model or implementation: BM25 (Sparse Retriever)

Responser

Generates an intermediate answer to the sub-question using retrieved context

Model or implementation: T5-Small/Base/Large (fine-tuned)

Novel Architectural Elements

Dual-student interaction framework: Instead of one student learning CoT, two distinct students (Decomposer/Responser) interact iteratively
Dynamic multi-step retrieval: Retrieval is triggered per sub-question rather than once for the global question

Modeling

Base Model: T5-Small (60M), T5-Base (220M), T5-Large (700M)

Training Method: Supervised Fine-Tuning (Distillation)

Objective Functions:

Purpose: Train Decomposer to generate sub-questions or final answers.

Formally: Minimize negative log-likelihood of sequence of sub-questions and final answer conditioned on history.
Purpose: Train Responser to answer sub-questions given retrieved context.

Formally: Minimize negative log-likelihood of intermediate answer sequence conditioned on sub-question and retrieved passages.

Training Data:

Teacher (GPT-3.5) generates samples via Self-Ask-Self-Ans prompting
Filtered samples where final answer matches ground truth (F1 > 0.7)
Used 1/10th of full training data for HotpotQA/2Wiki, 1/2 for StrategyQA

Key Hyperparameters:

learning_rate: 3e-4
batch_size: 16
epochs: 20
+ 1 more
optimizer: AdamW

Compute: Run on 2 NVIDIA GTX 3090 GPUs

Comparison to Prior Work

vs. CoTD: D&R uses two models and accesses external knowledge interactively, whereas CoTD relies on internal parameter knowledge
vs. RA-CoTD: D&R retrieves per sub-question (multi-step) vs. RA-CoTD's single-step retrieval; D&R decomposes task into two specialized models
vs. Least-to-Most Prompting [not cited in paper]: Similar decomposition logic, but D&R focuses on distillation into small models via interaction rather than just prompting LLMs

Limitations

Designed specifically for knowledge-intensive reasoning; applicability to other reasoning types (e.g., pure symbolic) not tested
Experiments limited to models < 1B parameters due to resource constraints
Relies on the quality of the Teacher LLM (GPT-3.5) and the retriever (BM25)

Reproducibility

Code: https://github.com/Xiang-Li-oss/D-R-Distillation

Publicly available code. Uses public datasets (HotpotQA, 2WikiMultiHopQA, StrategyQA). Relies on OpenAI API (GPT-3.5) for data generation. Uses Pyserini for BM25 retrieval.

📊 Experiments & Results

Evaluation Setup

Open-domain QA using Wikipedia dump (KILT version for HotpotQA/2Wiki)

Benchmarks:

HotpotQA (Multi-hop Question Answering)
2WikiMultiHopQA (Multi-hop Question Answering)
StrategyQA (Implicit Reasoning QA)

Metrics:

Answer F1
Answer Exact Match (EM)
Answer Accuracy (for StrategyQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
D&R Distillation significantly outperforms baselines on HotpotQA across varying model sizes, even with reduced training data.
HotpotQA	Answer F1	22.2	27.9	+5.7
HotpotQA	Answer F1	22.1	30.4	+8.3
2WikiMultiHopQA	Answer F1	35.8	39.4	+3.6
StrategyQA	Answer Accuracy	56.6	59.0	+2.4
HotpotQA	Answer F1	49.2	27.9	-21.3
2WikiMultiHopQA	Retrieval Recall	35.0	55.6	+20.6

Experiment Figures

Efficiency analysis comparing D&R Distillation against baselines across model sizes (3a) and training data ratios (3b).

Comparison of retrieval recall between one-step retrieval (OneR) and D&R's iterative retrieval.

Main Takeaways

D&R Distillation outperforms standard CoT and Retrieval-Augmented CoT distillation across all datasets and model sizes tested.
Two 220M T5 models using D&R Distillation can outperform an 11B FLAN-T5-XXL model on HotpotQA and 2WikiMultiHopQA, demonstrating extreme parameter efficiency.
The method requires significantly less training data (1/10th) to achieve these results compared to baselines using full data.
Interactive multi-step retrieval (Decomposer + Responser) yields much higher retrieval recall than one-step retrieval, mitigating hallucinations caused by missing knowledge.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) Prompting
Knowledge Distillation
Retrieval-Augmented Generation (RAG)

Key Terms

CoTD: Chain-of-Thought Distillation—teaching small models to reason by fine-tuning them on rationales generated by large models

SLM: Small Language Model—models with significantly fewer parameters (e.g., <1B) compared to LLMs

Self-Ask-Self-Ans: A prompting strategy where an LLM iteratively asks and answers its own sub-questions to solve a complex problem

BM25: Best Matching 25—a probabilistic information retrieval algorithm based on term frequency and inverse document frequency

Decomposer: The student model responsible for breaking down the main question into sub-questions or deciding the final answer

Responser: The student model responsible for answering the sub-questions generated by the Decomposer, using retrieved documents

F1 score: A metric measuring the overlap between the predicted answer and the ground truth answer

EM: Exact Match—a metric checking if the predicted answer is identical to the ground truth

Hallucination: When a model generates plausible-sounding but factually incorrect information