← Back to Paper List

Dr3: Ask Large Language Models Not to Give Off-Topic Answers in Open Domain Multi-Hop Question Answering

Y Gao, Y Zhu, Y Cao, Y Zhou, Z Wu, Y Chen, S Wu…

Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences

arXiv, 3/2024 (2024)

QA RAG Agent Reasoning

📝 Paper Summary

Open Domain Multi-Hop Question Answering (ODMHQA) Hallucination mitigation

Dr3 is a post-hoc mechanism that detects irrelevant answers in multi-hop QA using an LLM-based discriminator and iteratively corrects the reasoning chain via backtracking.

Core Problem

LLMs frequently generate off-topic answers (irrelevant to the question type) in open-domain multi-hop QA due to error propagation in reasoning, planning, and retrieval.

Why it matters:

Off-topic answers account for approximately 1/3 of incorrect answers in ODMHQA tasks, significantly degrading performance
Existing methods like ReAct intertwine reasoning and planning, making it difficult to isolate and fix the specific step (Decomposition, Sub-Question, or Composition) causing the drift

Concrete Example: For the question 'In which year was David Beckham's wife born?', an LLM might answer 'Barack Obama' (a name, not a year). Dr3 detects this type mismatch and backtracks to find the correct year.

Key Novelty

Discriminate-Re-Compose-Re-Solve-Re-Decompose (Dr3)

Discriminator: Uses the LLM itself to judge if a generated answer matches the expected semantic type of the question (e.g., Year vs. Person)
Corrector: A backtracking mechanism that systematically revises the solving history in reverse order (Composition → Sub-Question → Decomposition) until the answer is on-topic

Architecture

Architecture Figure Figure 4

The workflow of the Dr3 mechanism, including the Discriminator and the three-stage Corrector (Re-Compose, Re-Solve, Re-Decompose).

Evaluation Highlights

Reduces off-topic answers by nearly 13% compared to ReAct on HotpotQA and 2WikiMultiHopQA
Improves Exact Match (EM) by nearly 3% over the ReAct baseline on both datasets
Demonstrates that 62% of off-topic errors stem from sub-question steps (planning, passage retrieval, reasoning), which the Re-Solve module specifically targets

Breakthrough Assessment

7/10

Solid engineering contribution addressing a specific, prevalent error type (off-topic). The backtracking mechanism is effective, though it relies on heuristic iterative correction rather than a fundamental architectural shift.

⚙️ Technical Details

Problem Definition

Setting: Open Domain Multi-Hop Question Answering (ODMHQA) requiring retrieval from external knowledge sources

Inputs: Complex natural language question Q

Outputs: Final answer Ans

Pipeline Flow

Initial Solve (ReAct+)
Discriminator (Check if Ans is off-topic)
Corrector (If off-topic: Re-Compose → Re-Solve → Re-Decompose loop)

System Modules

ReAct+ Solver

Generate initial reasoning chain and answer using explicit sub-question decomposition

Model or implementation: text-davinci-002

Discriminator

Judge if the generated answer is valid for the question type by prompting LLM to conceptualize candidate answers

Model or implementation: text-davinci-002

Corrector (Re-Compose) (Correction)

Retry the final composition step with a negative constraint hint ('The answer is not [Ans_old]')

Model or implementation: text-davinci-002

Corrector (Re-Solve) (Correction)

Backtrack to sub-questions, replacing retrieved passages with lower-probability alternatives and resolving

Model or implementation: text-davinci-002

Corrector (Re-Decompose) (Correction)

Regenerate the initial decomposition of the complex question if previous steps fail

Model or implementation: text-davinci-002

Novel Architectural Elements

ReAct+: A structured variant of ReAct that decouples Sub-Questions into (Task, Action, Observation, Conclusion) tuples to facilitate targeted backtracking
Hierarchical backtracking mechanism (Re-Compose -> Re-Solve -> Re-Decompose) specifically designed for QA reasoning chains

Modeling

Base Model: text-davinci-002 (InstructGPT)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ReAct: ReAct+ explicitly structures sub-questions; Dr3 adds a post-hoc discriminator and backtracking correction loop
vs. Self-Consistency: Dr3 relies on semantic type checking (off-topic detection) rather than statistical consensus [not cited in paper]
vs. Reflexion: Dr3 uses specific heuristic backtracking stages (Compose, Solve, Decompose) rather than general textual reflection [not cited in paper]

Limitations

Relies on the intrinsic capability of the LLM itself to discriminate off-topic answers; if the LLM fails to discriminate, the method fails
Inference cost increases due to iterative backtracking (multiple calls to LLM and retriever)
Correction strategies are heuristic (e.g., replacing passages) and might not fix fundamental reasoning errors if valid information is missing

Reproducibility

Code: https://github.com/Gy915/Dr3

Code and data available at https://github.com/Gy915/Dr3. Prompts for ReAct+, Discriminator, and Corrector modules are detailed in Appendices A, B, and C.

📊 Experiments & Results

Evaluation Setup

Open-domain setting (question only, no paired context) for multi-hop QA

Benchmarks:

HotpotQA (2-hop QA (Wikipedia))
2WikiMultiHopQA (Multi-hop QA (Wikipedia + Wikidata))

Metrics:

Exact Match (EM)
F1 score
Cover Exact Match (Cover EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HotpotQA	EM	27.4	30.4	+3.0
HotpotQA	F1	34.6	38.5	+3.9
2WikiMultiHopQA	EM	26.8	29.4	+2.6
HotpotQA	Off-Topic Rate	22.2	9.6	-12.6
Ablation study of Corrector components shows Re-Solve (fixing sub-questions) is the most critical component.

Experiment Figures

Bar chart showing the prevalence of off-topic answers across different datasets (HotpotQA, 2Wiki) and methods (IO, CoT, ReAct).

Pie chart breaking down the causes of off-topic answers in HotpotQA.

Main Takeaways

Off-topic answers are a major source of error (~1/3 of incorrect answers) in LLM-based multi-hop QA.
Re-Solve (backtracking to sub-questions and replacing retrieved passages) provides the largest gain, addressing the 62% of errors stemming from sub-question steps.
The Discriminator successfully identifies off-topic answers by checking semantic consistency between the question and answer.

📚 Prerequisite Knowledge

Prerequisites

ReAct prompting paradigm (Reasoning + Acting)
Multi-hop Question Answering structure
Basic understanding of LLM hallucination

Key Terms

ODMHQA: Open Domain Multi-Hop Question Answering—answering complex questions by reasoning over multiple retrieved documents

ReAct: Reasoning-Acting—a prompting method where LLMs interleave reasoning traces (thoughts) and actions (like search) to solve problems

ReAct+: A modified version of ReAct proposed in this paper that explicitly decomposes complex questions into structured Sub-Questions (Task, Action, Observation, Conclusion)

Off-topic answer: A generated answer that is semantically irrelevant to the question (e.g., answering a 'When' question with a location)

Sub-Question: An intermediate step in the reasoning chain containing a specific sub-task, action, observation, and intermediate conclusion

ColBERTv2: A retrieval model that encodes queries and documents into token-level vectors for efficient and accurate search