TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation

📝 Paper Summary

Graph-based RAG pipeline Modularized RAG pipeline

TRACE improves multi-hop RAG by converting retrieved documents into a knowledge graph and autoregressively constructing reasoning chains of triples to identify supporting evidence.

Core Problem

Retrievers in RAG often return irrelevant documents that introduce noise, degrading performance on multi-hop questions requiring multi-step reasoning.

Why it matters:

Irrelevant documents in the retrieved set can mislead the reader model, causing hallucinations or incorrect answers
Multi-hop questions require connecting dispersed pieces of evidence, which standard RAG struggles to do when evidence is buried in noisy documents
Simply prepending all retrieved documents to the prompt often results in suboptimal performance due to the 'lost-in-the-middle' phenomenon

Concrete Example: For the question 'When was the father of Albert Einstein born?', a standard RAG might retrieve documents about Albert's physics theories. TRACE extracts the triple (Albert Einstein, father, Hermann Einstein) and then links it to (Hermann Einstein, date of birth, 3 July 1814) to answer correctly.

Key Novelty

Knowledge-Grounded Reasoning Chains (TRACE)

Converts unstructured documents into a structured Knowledge Graph (KG) of triples to granularly separate relevant facts from noise
Constructs reasoning chains autoregressively: selects a triple from the KG, then selects the next triple based on the question and previous triples, mimicking human step-by-step reasoning

Architecture

Overview of the TRACE framework, illustrating the pipeline from documents to answer.

Evaluation Highlights

+14.03% average improvement in Exact Match (EM) over standard RAG (using all retrieved documents) across three multi-hop QA datasets
Using only the reasoning chains (KG triples) as context is often sufficient, outperforming the use of full documents by reducing noise
Adaptive chain termination strategy significantly improves performance compared to fixed-length chains

Breakthrough Assessment

7/10

Strong empirical gains on multi-hop QA by integrating KG construction into RAG. The approach effectively addresses noise in retrieval, though reliance on LLMs for KG generation might be computationally heavy.

⚙️ Technical Details

Problem Definition

Setting: Multi-hop Question Answering (QA) using a set of retrieved documents

Inputs: Multi-hop question q and a set of retrieved documents D_q = {d_1, ..., d_N}

Outputs: Answer a

Pipeline Flow

KG Generator (documents -> KG triples)
Reasoning Chain Constructor (KG -> reasoning chains)
Answer Generator (chains -> answer)

System Modules

KG Generator

Transform retrieved documents into a set of knowledge triples

Model or implementation: LLM (e.g., gpt-3.5-turbo-0125)

Triple Ranker (Reasoning Chain Construction)

Select top-K candidate triples relevant to the question and current chain history

Model or implementation: Bi-encoder (sentence-transformers/all-MiniLM-L6-v2)

Triple Selector (Reasoning Chain Construction)

Select the single best triple from candidates to extend the chain

Model or implementation: LLM (e.g., gpt-3.5-turbo-0125)

Reader / Answer Generator

Generate the final answer based on the constructed reasoning chains

Model or implementation: LLM (e.g., gpt-3.5-turbo-0125)

Novel Architectural Elements

Autoregressive Reasoning Chain Constructor: Builds reasoning paths step-by-step from a document-derived KG using a ranker-selector loop
Adaptive Chain Termination: A mechanism within the Triple Selector allowing the model to decide when enough evidence has been gathered ('no need for additional triples')

Modeling

Base Model: gpt-3.5-turbo-0125 (for KG Generation, Triple Selection, and Reader)

Training Method: In-context learning / Zero-shot inference

Key Hyperparameters:

beam_size_b: 3
candidate_triples_K: 5
max_chain_length_L: 4

Comparison to Prior Work

vs. Standard RAG: TRACE constructs a KG and filters noise via reasoning chains instead of using raw documents
vs. Refine/Select-Reader: TRACE operates at the granularity of triples rather than documents or sentences, allowing finer-grained noise reduction
vs. Graph-RAG (Microsoft) [not cited in paper]: TRACE builds query-specific reasoning chains dynamically rather than just summarizing graph communities
+ 1 more
vs. Tree of Thoughts (ToT) [not cited in paper]: TRACE applies tree search (beam search) specifically over a constructed KG for retrieval, rather than generic thought generation

Limitations

Heavy reliance on LLM calls for KG generation (one call per document) and chain construction (step-by-step)
Performance depends on the quality of the ad-hoc KG generation; errors in triple extraction propagate
KG generation is independent of the question, potentially missing question-specific nuances during extraction
Evaluated only in a zero-shot setting; fine-tuning performance is unknown

Reproducibility

Code: https://github.com/jyfang6/trace

📊 Experiments & Results

Evaluation Setup

Zero-shot Multi-hop Question Answering

Benchmarks:

HotpotQA (Multi-hop reasoning QA)
2WikiMultihopQA (Multi-hop reasoning QA)
Musique (Multi-hop reasoning QA)

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results demonstrating TRACE variants outperform baselines across three datasets.
HotpotQA	EM	34.83	47.77	+12.94
2WikiMultihopQA	EM	30.43	43.14	+12.71
Musique	EM	15.05	31.48	+16.43
HotpotQA	EM	43.03	47.77	+4.74
HotpotQA	EM	43.51	47.77	+4.26
HotpotQA	EM	43.76	47.77	+4.01

Experiment Figures

Impact of different beam sizes (b) on Exact Match (EM) performance across the three datasets.

Performance comparison between adaptive chain termination and fixed chain lengths (L=2, 3, 4).

Main Takeaways

TRACE variants (Triple and Doc) consistently outperform Standard RAG and other baselines (Select-Reader, Refine-Reader) across all datasets.
Using reasoning chains (triples) directly as context (TRACE-Triple) performs comparably or better than using full documents (TRACE-Doc), suggesting triples are sufficient and reduce noise.
Adaptive chain termination is crucial; forcing fixed lengths hurts performance because different questions require different reasoning depths.
Beam search helps explore multiple reasoning paths, improving robustness over greedy decoding (b=1).

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Knowledge Graphs (KG) and triples
In-context learning / Prompting LLMs
Beam search decoding

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Knowledge Graph (KG): A structured representation of knowledge using triples (head entity, relation, tail entity)

KG Generator: A module that prompts an LLM to extract knowledge triples from unstructured text

Reasoning Chain: A sequence of logically connected knowledge triples used to derive an answer

Autoregressive: A process where the current output depends on previous outputs (here, selecting the next triple based on previous ones)

Bi-encoder: A model architecture that encodes the query and candidate items (triples) separately into vector embeddings for efficient similarity comparison

Exact Match (EM): A metric measuring the percentage of predictions that match the ground truth answer exactly

Triple Ranker: A component that scores the relevance of KG triples to the current reasoning context

Triple Selector: A component that chooses the best triple from candidates to extend the reasoning chain

Beam Search: A search algorithm that explores a graph by expanding the most promising nodes (chains) at each step

Lost-in-the-middle: A phenomenon where LLMs fail to retrieve information located in the middle of a long input context