REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering

📝 Paper Summary

Agentic RAG pipeline

REAP separates reasoning into a Sub-task Planner that dynamically updates a global plan and a Fact Extractor that retrieves and validates evidence, enabling error recovery in complex multi-hop queries.

Core Problem

Existing iterative RAG methods for multi-hop questions often get stuck in local reasoning impasses or fail to exploit latent clues because they lack global planning and dynamic error recovery.

Why it matters:

Incremental decomposition of complex queries is brittle; one failed step can derail the entire reasoning chain without a mechanism to recover
Current models often extract direct answers while ignoring latent clues necessary for subsequent steps, leading to incomplete evidence
Search-based methods like MCTS offer planning but suffer from high computational overhead, making them inefficient for real-time applications

Concrete Example: If a system needs to find 'the director of the movie starring X', it might first search for 'movies starring X'. If the retrieval returns a list but misses the specific movie intended by the context, a standard chain-of-thought method fails. REAP's planner would detect the insufficient fact, diagnose the failure, and trigger a 'Re-Planner' module to reformulate the search query or prune the invalid branch.

Key Novelty

Recursive Evaluation and Adaptive Planning (REAP)

Explicitly decouples 'Planning' (Sub-task Planner) from 'Execution' (Fact Extractor) into two distinct modules that operate in a recursive loop
Introduces a 'Re-Planner' sub-module that activates only when reasoning fails, performing pragmatic sufficiency checks (is this partial info enough?) or scoped plan repair (rewriting queries/pruning branches)
Uses a unified multi-task fine-tuning paradigm to transfer knowledge from data-rich routine planning tasks to data-scarce critical replanning scenarios

Architecture

The REAP framework architecture, illustrating the recursive loop between the Sub-task Planner (SP) and Fact Extractor (FE).

Evaluation Highlights

Outperforms state-of-the-art method R1-Searcher by +4.6% F1 on HotpotQA and +10.2% F1 on 2WikiMultihopQA
Achieves superior generalization on out-of-domain datasets (MuSiQue, Bamboogle) despite being trained only on HotpotQA and 2WikiMultihopQA
Surpasses Fine-Tuned Standard RAG by +6.8% F1 on HotpotQA, proving gains stem from the iterative architecture rather than just training data

Breakthrough Assessment

8/10

Strong performance gains and a logically sound architecture that addresses the brittleness of static chains-of-thought via explicit replanning. The unified training strategy for scarce failure cases is a smart, practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Iterative Retrieval-Augmented Generation for Multi-hop Question Answering

Inputs: A complex query Q and an external corpus C

Outputs: A factual answer A grounded in retrieved evidence

Pipeline Flow

Initialization: Decomposer creates initial plan P_0
Iterative Loop: SP (Plan Updater/Re-Planner) → Action Selection → FE (Retrieval + Extraction) → Facts Update → SP
Termination: Synthesizer generates final answer

System Modules

Decomposer (Planning)

Generates the initial structured task plan (list of sub-tasks with dependencies) from the query

Model or implementation: Llama-3.1-8B-Instruct (fine-tuned)

Sub-task Planner (SP) (Planning)

Maintains global state; dispatches to 'Plan Updater' (routine) or 'Re-Planner' (failure) based on fact fulfillment level

Model or implementation: Llama-3.1-8B-Instruct (fine-tuned)

Fact Extractor (FE)

Retrieves documents and extracts structured facts (statement, evidence, reasoning chain)

Model or implementation: Llama-3.1-8B-Instruct (fine-tuned)

Synthesizer

Generates the final answer based on the accumulated facts

Model or implementation: Llama-3.1-8B-Instruct (fine-tuned)

Novel Architectural Elements

Recursive dual-module loop: Explicitly separating the global planning state (SP) from the local execution state (FE)
Conditional dispatch architecture within SP: Routing flow to lightweight 'Plan Updater' for success cases vs. heavy 'Re-Planner' for failure cases
Structured Fact Object: FE outputs a tuple (statement, evidence, reasoning, fulfillment_level) rather than just text, directly driving the SP's control logic

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Multi-task Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize joint loss across decomposing, updating, and replanning tasks to transfer knowledge.

Formally: L_multi = Σ λ_task * L_task, where tasks are decomp, update, replan

Training Data:

5,556 total samples (2,988 HotpotQA, 2,568 2WikiMultihopQA)
Data collected by running REAP logic using GPT-4 to generate silver labels

Key Hyperparameters:

retrieval_top_k: 5
max_iterations: 5
lambda_task: 1

Compute: Inference uses Llama-3.1-8B-Instruct; Training data generation used GPT-4

Comparison to Prior Work

vs. IRCoT: REAP separates planning from execution, allowing explicit plan repair (pruning/rewriting) rather than just linear generation
vs. Search-R1: REAP explicitly models 'facts' as structured objects with fulfillment levels, rather than just context for generation
vs. Plan-and-Solve [not cited in paper]: REAP adds a feedback loop where execution results (FE) actively modify the remaining plan, whereas Plan-and-Solve typically executes a static plan

Limitations

Heavy reliance on the capability of the underlying LLM to strictly follow complex JSON structures for facts and plans
Latency may be higher than single-step RAG due to recursive loops (up to 5 iterations)
Performance depends on the quality of the initial retrieval; if the corpus lacks the answer entirely, replanning loops might waste compute

Reproducibility

Code: https://github.com/Deus-Glen/REAP

Code is publicly available at https://github.com/Deus-Glen/REAP. Training data generation process (using GPT-4) is described. Uses standard corpora (CoRAG/KILT Wikipedia) and retrievers (e5-large-v2).

📊 Experiments & Results

Evaluation Setup

Multi-hop Question Answering on in-domain and out-of-domain datasets

Benchmarks:

HotpotQA (Multi-hop reasoning QA)
2WikiMultihopQA (Multi-hop reasoning QA)
MuSiQue (Multi-hop reasoning QA (Out-of-domain))
Bamboogle (Multi-hop reasoning QA (Out-of-domain))

Metrics:

F1 score
Cover Exact Match (CEM)
ACC† (LLM-judge accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
REAP significantly outperforms standard and iterative RAG baselines on in-domain datasets.
HotpotQA	F1	63.4	68.0	+4.6
HotpotQA	F1	61.2	68.0	+6.8
2WikiMultihopQA	F1	51.8	62.0	+10.2
REAP demonstrates strong generalization on out-of-domain datasets not seen during training.
MuSiQue	F1	26.3	44.6	+18.3
Bamboogle	ACC†	59.2	73.6	+14.4

Main Takeaways

Iterative interaction with explicit planning (REAP) yields substantial gains over single-step RAG (up to +20% F1 compared to Standard RAG)
The unified task paradigm effectively transfers planning capabilities: training on HotpotQA/2Wiki enables strong performance on MuSiQue, suggesting the model learns 'how to plan' rather than memorizing dataset patterns
REAP consistently outperforms other iterative methods (IRCoT, Iter-RetGen) and recent RL-based methods (Search-R1), validating the dual-module (Planner + Extractor) design

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Chain-of-Thought (CoT) reasoning
Basic knowledge of multi-task fine-tuning

Key Terms

MHQA: Multi-Hop Question Answering—Tasks requiring the integration of information scattered across multiple documents to answer a single complex query

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Sub-task Planner (SP): The strategic module in REAP that maintains the global task plan, updates dependencies, and initiates replanning if steps fail

Fact Extractor (FE): The execution module in REAP that retrieves documents and extracts structured facts (statement + evidence + reasoning) from them

Re-Planner: A specialized sub-module of the SP invoked during failures to assess if partial info is sufficient or if the plan needs structural repair

F1 score: A metric balancing precision and recall, measuring the overlap between the predicted answer and the ground truth

FlashRAG: A Python toolkit for efficient RAG research, used here as the evaluation framework

CoRAG: A dataset/corpus based on English Wikipedia used for retrieval in these experiments

MCTS: Monte Carlo Tree Search—a heuristic search algorithm for decision processes, often used in complex reasoning planning

ACC†: Accuracy measured with an LLM serving as the judge