Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness

📝 Paper Summary

Modularized RAG pipeline

RDR2 improves RAG by using an LLM router to actively navigate document headings and sections like a human reader, rather than treating documents as flat lists of isolated chunks.

Core Problem

Standard RAG systems treat retrieved passages as isolated chunks, discarding the original document structure (headings, hierarchy) that helps humans navigate and synthesize complex information.

Why it matters:

Losing structural context forces models to implicitly reconstruct relationships that were explicitly present in the source, harming multi-hop reasoning
Fixed chunking strategies restrict query-adaptive content selection, often missing relevant details buried in related sections
Flat retrieval paradigms struggle with 'factual-inductive' queries that require synthesizing multiple fragments scattered across a document

Concrete Example: When answering a complex question about a specific entity, standard RAG might retrieve three disjoint paragraphs. RDR2 instead locates the relevant heading in the document tree, then decides to 'expand' that section to read adjacent context, effectively re-assembling the complete evidence.

Key Novelty

Retrieve-DocumentRoute-Read (RDR2)

Formulates document reading as a dynamic routing task over a Document Structure Tree (DST), where an agent iteratively decides to Answer, Expand (unfold headings), or Refuse content
Introduces a method to automatically curate training data for this routing policy using only questions and documents (no answer supervision required), enabling the router to learn human-like browsing strategies

Architecture

The 3-stage RDR2 pipeline: Retrieve, Document Route, and Read. It details the iterative routing process where an LLM selects actions ([ANS], [EXP], [REF]) on a tree structure.

Evaluation Highlights

Achieves state-of-the-art results on ASQA (+1.5 EM) and QAMPARI (+3.0 F1-5) using only off-the-shelf retrievers and readers
Outperforms proprietary-based methods (like ASC using ChatGPT) while generating answers that are ~50% shorter
Demonstrates effective test-time scaling: increasing expansion iterations consistently improves passage utility and answer quality without retraining

Breakthrough Assessment

8/10

Strong conceptual novelty in treating documents as trees rather than flat chunks. Achieves SOTA on difficult benchmarks with a lightweight, efficiently trained router, showing excellent generalization.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (QA) using a datastore of documents with hierarchical structure

Inputs: Natural language question q and a datastore D

Outputs: Answer a generated based on routed passages

Pipeline Flow

Retriever (fetches top-k chunks & source docs)
Document Router (iteratively navigates document trees)
Reader (generates final answer)

System Modules

Retriever

Fetch initial relevant chunks and identify their source documents

Model or implementation: Contriever-MS MARCO (off-the-shelf)

Document Router

Navigate document structure trees to select optimal passages

Model or implementation: Llama-3.1-8B-Instruct (fine-tuned with LoRA)

Reader

Synthesize the final answer from the routed passages

Model or implementation: Llama-2-13B-Chat or Llama-3.1-8B-Instruct (off-the-shelf)

Novel Architectural Elements

Iterative Document Routing loop: An explicit feedback loop where an agent navigates a tree structure (DST) to dynamically update the context (RST) before reading
Separation of 'Structure Nodes' (headings) and 'Content Nodes' (text) in the retrieval context

Modeling

Base Model: Llama-3.1-8B-Instruct (for the Router)

Training Method: Supervised Fine-Tuning (SFT) on curated routing trajectories

Objective Functions:

Purpose: Learn to predict the next routing action token.

Formally: Standard cross-entropy loss on target tokens.

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

Curated from ASQA training set questions using Deepseek-v3 to simulate oracle routing paths
23,827 training samples

Key Hyperparameters:

epochs: 3.5
learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Single NVIDIA A100-PCIE-40GB GPU

Comparison to Prior Work

vs. SELF-RAG: RDR2 navigates explicit document structure (headings) rather than filtering flat chunks
vs. GraphRAG: RDR2 operates on the native document tree structure online, rather than pre-computing a separate knowledge graph [not cited in paper]
vs. RAPTOR: RDR2 dynamically routes through the tree at inference time, whereas RAPTOR relies on pre-computed hierarchical embeddings

Limitations

Relies on the availability of document structure (headings); performance on unstructured text is unclear
Limited gains on open-ended tasks like ELI5 compared to factoid tasks
Router training requires a capable teacher model (Deepseek-v3 used) to curate trajectories

Reproducibility

Code: https://github.com/XuLingnan/RDR2

Code and data are publicly available at https://github.com/XuLingnan/RDR2. The router was trained on ASQA questions. Hyperparameters for LoRA training (rank, alpha) are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Knowledge-intensive QA across 5 datasets using Wikipedia as the datastore

Benchmarks:

TriviaQA (Single-hop short-form QA)
HotpotQA (Multi-hop reasoning QA)
QAMPARI (List-style QA)
ASQA (Ambiguous long-form QA)
ELI5 (Open-ended long-form QA)

Metrics:

EM (Exact Match)
F1-5
Recall-5
Precision-5
Claim Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against baselines showing RDR2 superiority on complex QA tasks.
ASQA	EM	47.2	48.7	+1.5
QAMPARI	F1-5	31.4	34.4	+3.0
ELI5	Claim Recall	24.6	25.1	+0.5
Ablation studies validating the importance of structural awareness and specific routing actions.
ASQA	EM (Answer)	51.1	55.1	+4.0
ASQA	EM (Passage)	64.0	68.4	+4.4

Experiment Figures

Impact of test-time scaling on performance. Left: scaling retrieval k. Right: scaling expansion iterations.

Main Takeaways

Explicit structural awareness enhances RAG performance significantly more than simple chunk filtering, especially for multi-hop or list-style questions.
The 'Expand' action is critical for uncovering relevant information that initial similarity-based retrieval might miss.
RDR2 generalizes well across different readers and retrievers without requiring retraining of those components.
The framework allows for test-time scaling: computing more expansion steps linearly improves passage utility.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) basics
Tree data structures (nodes, children, parents)
Language Model fine-tuning (LoRA)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by searching for external documents

DST: Document Structure Tree—a hierarchical representation of a document where nodes are headings (structure) or text passages (content)

RST: Retrieval SubTree—a dynamic subset of the DST used during inference to maintain focus while exploring

Router: An LLM agent trained to navigate the document tree by selecting, expanding, or rejecting nodes

LoRA: Low-Rank Adaptation—a parameter-efficient technique for fine-tuning large language models

EM: Exact Match—a metric checking if the generated answer is identical to the ground truth

F1-5: A metric for list-style QA, measuring the overlap between the predicted list and the gold list, capped at 5 items

SFT: Supervised Fine-Tuning—training a model on labeled examples

Dense Retriever: A retrieval system using vector embeddings to find relevant text

Off-the-shelf: Using pre-trained models without further modification or fine-tuning

Greedy decoding: A generation strategy where the model always picks the most likely next token