IRCoT: Iterleaving retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

📝 Paper Summary

Modularized RAG pipeline Agentic RAG pipeline

IRCoT alternates between generating reasoning steps (Chain-of-Thought) and using those steps as search queries to progressively retrieve missing information for complex questions.

Core Problem

Standard one-step retrieval fails for multi-hop questions because the necessary search terms for later steps only become apparent after partial reasoning on initial evidence.

Why it matters:

LLMs struggle with open-domain questions where knowledge is not in their parameters or up-to-date
One-shot retrieval often misses evidence that has no lexical overlap with the original question but is crucial for intermediate reasoning steps
Without retrieving supporting facts first, models hallucinate reasoning steps; without correct reasoning steps, models cannot retrieve the next necessary fact

Concrete Example: Question: 'In what country was Lost Gravity manufactured?' Initial retrieval on 'Lost Gravity' finds a roller coaster description but not the country. The model must first infer the manufacturer ('Mack Rides') from the text, then use that new term to retrieve the manufacturing location ('Germany'). One-step retrieval misses this connection.

Key Novelty

Interleaved Retrieval guided by Chain-of-Thought (IRCoT)

Uses the LLM's generated reasoning sentence as a dynamic search query to find new paragraphs
Uses the newly retrieved paragraphs to inform the generation of the next reasoning sentence
This cycle repeats until the answer is found, allowing retrieval and reasoning to guide each other step-by-step

Architecture

Overview of the IRCoT method compared to standard one-step retrieval

Evaluation Highlights

Improves retrieval recall by 11-21 points over one-step retrieval baselines on datasets like HotpotQA and 2WikiMultihopQA
Boosts downstream few-shot QA performance by up to 15 F1 points using GPT-3 (code-davinci-002)
Reduces factual errors in generated Chain-of-Thought reasoning by up to 50% compared to baselines
Smaller model (Flan-T5-XL, 3B) with IRCoT outperforms a 58x larger GPT-3 model using standard one-step retrieval

Breakthrough Assessment

9/10

Significantly advances few-shot multi-step QA by solving the disconnect between static retrieval and dynamic reasoning. Demonstrated efficacy across model sizes (3B to 175B) and OOD settings without training.

⚙️ Technical Details

Problem Definition

Setting: Few-shot open-domain multi-step Question Answering (QA)

Inputs: Natural language question Q and a large corpus of documents

Outputs: Final answer A and a sequence of reasoning steps (Chain-of-Thought)

Pipeline Flow

Base Retrieval (Initial query using Question)
Interleaved Cycle (Extend CoT → Expand Retrieved Info → Repeat)
Final QA (Read all collected paragraphs → Answer)

System Modules

Base Retriever (Retrieval)

Perform initial retrieval using the question Q to get a base set of paragraphs

Model or implementation: BM25 (Elasticsearch)

CoT Reasoner

Generate the next sentence in the chain-of-thought based on current context

Model or implementation: GPT-3 (code-davinci-002) or Flan-T5

CoT-Guided Retriever (Retrieval)

Retrieve additional paragraphs using the last generated CoT sentence as the query

Model or implementation: BM25 (Elasticsearch)

QA Reader

Read all collected paragraphs and generate the final answer

Model or implementation: GPT-3 (code-davinci-002) or Flan-T5

Novel Architectural Elements

Interleaved topology: The output of the Reasoner (CoT sentence) becomes the input for the Retriever, and the output of the Retriever (paragraphs) becomes the input for the Reasoner in the next step

Modeling

Base Model: OpenAI GPT-3 (code-davinci-002) and Flan-T5 (XL/XXL)

Key Hyperparameters:

max_reasoning_steps: 8
max_collected_paragraphs: 15
K (paragraphs per step): {2, 4, 6, 8}
+ 3 more
M (distractor paragraphs in prompts): {1, 2, 3}
context_limit_gpt3: 8000 tokens
context_limit_flan_t5: 6000 tokens

Compute: Flan-T5 context limited to fit 80G A100 GPUs

Comparison to Prior Work

vs. Self-Ask: IRCoT uses CoT sentences directly as queries rather than explicit sub-question decomposition
vs. DecomP: IRCoT focuses specifically on CoT-guided retrieval loops rather than general task decomposition
vs. ReAct: IRCoT works effectively on smaller models (3B) without fine-tuning, whereas ReAct relies on PaLM-540B or fine-tuning
+ 1 more
vs. DSP (Demonstrate-Search-Predict) [not cited in paper]: DSP freezes the pipeline structure, whereas IRCoT dynamically unrolls the retrieval/reasoning loop based on the question needs

Limitations

Relies on the availability of a high-quality base retriever (BM25)
Inference cost increases linearly with the number of reasoning steps due to multiple retrieval and generation calls
Performance depends on the quality of few-shot demonstrations provided in the prompt
Current implementation uses exact lexical matching (BM25) which may miss semantic matches compared to dense retrieval

Reproducibility

Code: https://github.com/stonybrooknlp/ircot

Code, data, and prompts publicly available at https://github.com/stonybrooknlp/ircot. Uses public datasets (HotpotQA, 2WikiMultihopQA, MuSiQue, IIRC).

📊 Experiments & Results

Evaluation Setup

Few-shot Open-Domain QA using Wikipedia corpus

Benchmarks:

HotpotQA (Multi-hop QA (Bridge/Comparison))
2WikiMultihopQA (Multi-hop QA)
MuSiQue (Multi-hop QA (Connected reasoning))
IIRC (Reading comprehension with external retrieval)

Metrics:

Answer F1
Retrieval Recall (of gold paragraphs)
Statistical methodology: Reported mean and standard deviation across 3 demonstration sets

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
IRCoT significantly improves retrieval recall compared to one-step retrieval across all four datasets.
HotpotQA	Recall	40.9	62.3	+21.4
2WikiMultihopQA	Recall	39.4	51.8	+12.4
IRCoT leads to substantial gains in downstream QA F1 scores compared to one-step retrieval baselines.
HotpotQA	F1	46.3	59.2	+12.9
MuSiQue	F1	24.2	39.5	+15.3
Smaller models using IRCoT can outperform much larger models using standard retrieval.
HotpotQA	F1	46.3	49.6	+3.3

Experiment Figures

Recall performance of IRCoT vs One-step retrieval across varying paragraph budgets (K)

Main Takeaways

Interleaving retrieval and reasoning consistently outperforms one-step retrieval for multi-hop questions across diverse datasets
The approach is effective for both massive models (GPT-3) and smaller models (Flan-T5-XL/XXL), enabling smaller models to punch above their weight class
Gains are robust in Out-of-Distribution (OOD) settings where prompts come from a different dataset than the test questions
Qualitative analysis shows IRCoT reduces hallucination errors in reasoning chains by anchoring steps in retrieved facts

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Retrieval-Augmented Generation (RAG)
BM25 retrieval algorithm

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

IRCoT: Interleaved Retrieval guided by Chain-of-Thought—the paper's proposed method of alternating retrieval and reasoning

BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query based on term frequency

OOD: Out-of-Distribution—testing the model on data that is different from the examples provided in the prompt/training

Recall: A metric measuring the proportion of relevant documents (gold paragraphs) successfully found by the retriever

OneR: One-step Retriever—a baseline method that retrieves documents once using only the original question

ODQA: Open-Domain Question Answering—answering questions using a large external collection of documents rather than a specific provided text

F1 score: A metric balancing precision and recall for the final answer text overlap with the ground truth