← Back to Paper List

To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic Retrieval Augmented Generation

KD Dhole
University of Southern California
arXiv, 1/2025 (2025)
RAG Factuality QA

📝 Paper Summary

Modularized RAG pipeline
This paper evaluates various black-box uncertainty detection methods to dynamically trigger retrieval in RAG systems only when the model is unsure, balancing accuracy with retrieval costs.
Core Problem
Most RAG systems retrieve deterministically for every query, which is inefficient and costly, while existing conditional methods (like token probability) fail to accurately gauge true knowledge gaps.
Why it matters:
  • Constant retrieval is computationally expensive and slow for long-form generation tasks
  • Retrieving irrelevant information when the model already knows the answer can degrade performance
  • Rigid heuristics for triggering retrieval often miss subtle hallucinations or unnecessary calls
Concrete Example: For the question 'Which film has the director who died first?', a model might confidently generate a partial answer but hallucinate the death date. An efficient system should detect this specific uncertainty and trigger retrieval only for the missing date, rather than retrieving for the whole query.
Key Novelty
Dynamic Retrieval via Uncertainty Detection Metrics
  • Instead of always retrieving, the system generates a temporary response and measures its uncertainty using metrics like Semantic Sets or Eigenvalue Laplacian
  • If uncertainty exceeds a threshold, the system triggers retrieval using a generated sub-query; otherwise, it uses the model's internal knowledge
Evaluation Highlights
  • Eccentricity-based uncertainty detection achieves an F1 score of 0.605 on 2WikiMultihopQA, outperforming the 'Always Retrieve' baseline (0.552) while using fewer retrieval calls
  • Degree Matrix (Jaccard) reduces retrieval calls significantly while maintaining an F1 score (0.524) comparable to 'Always Retrieve' (0.538-0.552)
  • Semantic Sets clustering performed poorly with an F1 score of 0.411, suggesting that semantic diversity alone is not a reliable trigger for this task
Breakthrough Assessment
4/10
Provides a useful comparative analysis of existing uncertainty metrics for RAG, but the proposed combination is an application of known methods rather than a fundamental algorithmic breakthrough. Sample sizes are small.
×