After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG

📝 Paper Summary

Modularized RAG pipeline Hallucination suppression

BRIDGE is a framework that dynamically weights internal versus external knowledge reliance and selects optimal response strategies (including refusal) to handle conflicting or unreliable information in RAG systems.

Core Problem

RAG models struggle to balance internal parametric knowledge with retrieved external knowledge, often failing when sources conflict, contain errors, or when both are unreliable.

Why it matters:

Current systems often blindly trust retrieval (vulnerable to poisoning) or stubbornly hold outdated internal beliefs, lacking flexibility
Existing approaches typically address single scenarios (e.g., always retrieving or always refusing) but lack a unified framework for all real-world conditions
Refusal mechanisms are critical for safety but are frequently overlooked in standard RAG pipelines, leading to hallucinations when no good information exists

Concrete Example: For the question 'What is the CPU of iPhone 16?', a model with a 2023 cutoff has no internal knowledge. If retrieval returns poisoned text saying 'A17', standard RAG generates the wrong answer. Ideally, the model should verify evidence and refuse to answer if neither source is credible.

Key Novelty

Biased Retrieval and Generation Evaluation (BRIDGE)

Introduces 'soft bias': an adaptive weighting mechanism that predicts how much a question relies on retrieval vs. internal knowledge before generation
Uses a multi-granularity scorer to compare internal knowledge, external knowledge, and generated sub-queries to detect consistency
Selects the final strategy (Faithful to Internal, Faithful to External, or Refuse) via an interpretable decision tree based on trust scores

Architecture

The BRIDGE framework workflow, detailing the two main stages: Bias-Guided Knowledge Collection and Bias-Guided Knowledge Evaluation.

Evaluation Highlights

Outperforms baselines by 5–15% in accuracy on the proposed TRD benchmark while maintaining balanced performance across scenarios
Achieves superior refusal rates when appropriate (Refuse-to-Answer scenario), whereas standard RAG baselines refuse <10% of the time
Demonstrates robustness by improving performance on out-of-domain datasets (RealtimeQA, HotpotQA-Poisoned) using hyperparameters tuned only on TRD

Breakthrough Assessment

8/10

Offers a comprehensive solution to the 'knowledge conflict' problem in RAG by unifying refusal, internal adherence, and external adherence into one framework. The construction of the TRD benchmark is also a significant contribution.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where both internal parameteric knowledge and retrieved documents may be noisy or conflicting

Inputs: Natural language question q

Outputs: Final response strategy (Faithful to All/Internal/External, or Refuse) and the generated answer

Pipeline Flow

Group: Bias-Guided Knowledge Collection: Allocator → Multi-query Generator/Retriever
Group: Bias-Guided Knowledge Evaluation: Scorer → Decision Tree → Reflection (optional)

System Modules

Allocator (Bias-Guided Knowledge Collection)

Predict retrieval dependency (r_p) and generation dependency (g_p) probabilities to guide resource allocation

Model or implementation: Llama-3-8B-Instruct (tuned via GRPO or prompted via ICL)

Multi-query Generator/Retriever (Bias-Guided Knowledge Collection)

Generate sub-queries and execute them via generation (for internal knowledge) or retrieval (for external knowledge) based on soft bias

Model or implementation: LLM for generation, Retriever for search

Scorer (Bias-Guided Knowledge Evaluation)

Compute similarity scores between four knowledge sources (K_int, K_ext, K_gen, K_ret)

Model or implementation: BGE-m3 encoder

Maximum Soft-bias Decision Tree (Bias-Guided Knowledge Evaluation)

Select the optimal response strategy based on trust scores derived from soft bias and matching scores

Model or implementation: Rule-based decision tree with learned thresholds

Novel Architectural Elements

Soft-bias allocator that dynamically distributes sub-query budget between generation (internal) and retrieval (external) streams
Maximum Soft-bias Decision Tree that integrates dependency probabilities with multi-granularity consistency scores to decide refusal or adherence

Modeling

Base Model: Llama-3-8B-Instruct (Allocator and Generator)

Training Method: Group Relative Policy Optimization (GRPO) for Allocator

Objective Functions:

Purpose: Encourage the model to predict bias directions consistent with ground truth labels.

Formally: Direction Reward checks if sign(g_p - r_p) matches the label.
Purpose: Ensure output follows the specified format.

Formally: Format Reward checks for correct XML tag structure.
Purpose: Ensure probabilities sum to 100%.

Formally: Sum Reward checks if r_p + g_p = 100%.
Purpose: Assess the quality of the reasoning path.

Formally: Analysis Quality Reward uses a reward model to score the generated rationale.

Adaptation: Full fine-tuning of Allocator (Llama-3-8B-Instruct)

Training Data:

TRD dataset constructed from Natural Questions (NQ) and TAQA
Soft bias labels derived from hard labels (0/100) via reasoning models (DeepSeek-R1, o3-mini)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. SelfRAG: BRIDGE explicitly models soft dependency on internal/external knowledge rather than just generating tokens; BRIDGE includes a refusal mechanism for low-confidence cases
vs. RobustRAG/USC/OPIN: BRIDGE offers a unified framework handling all four scenarios (conflicts, poisoning, outdated info, refusal) rather than optimizing for just one (e.g., robustness or consistency)
vs. CRAG [not cited in paper]: Corrective RAG focuses on refining retrieval quality; BRIDGE focuses on decision-making *after* retrieval but *before* generation

Limitations

Dependency on the quality of the BGE-m3 scorer for similarity metrics
Thresholds (alpha, beta) require tuning on a validation set (TRD)
Computational cost of generating multiple sub-queries and computing multiple similarity scores

Reproducibility

Code: https://github.com/Kangkang625/BRIDGE

Code and benchmark (TRD) are publicly available at https://github.com/Kangkang625/BRIDGE. TRD dataset construction details are provided. Specific hyperparameters for GRPO training (LR, batch size) are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Open-domain QA with scenarios involving correct/incorrect internal and external knowledge

Benchmarks:

TRD (Trustworthiness Response Dataset) (Multiple-choice QA with refusal option) [New]
RealtimeQA (Dynamic news QA (Faithful to External scenario))
HotpotQA Poisoned (QA with corrupted retrieval (Faithful to Internal scenario))
ConflictBank (QA with temporal mismatch (Faithful to Internal scenario))

Metrics:

Accuracy (Acc)
Rejection Rate (RR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on the TRD dataset showing BRIDGE's superior accuracy across diverse scenarios compared to baselines.
TRD (Overall)	Accuracy	44.75	63.22	+18.47
TRD (Overall)	Accuracy	51.17	63.22	+12.05
Performance on scenario-specific datasets demonstrates generalization and robustness.
RealtimeQA	Accuracy	66.52	68.23	+1.71
HotpotQA Poisoned	Accuracy	46.10	52.80	+6.70

Main Takeaways

BRIDGE achieves balanced performance across all four RAG scenarios (Faithful to All, Internal, External, and Refusal), whereas baselines typically excel in only one.
The adaptive 'soft bias' mechanism successfully guides the model to trust the correct source (internal vs external) dynamically.
The refusal mechanism is significantly more effective than baselines, which rarely refuse to answer even when necessary (Refuse-to-Answer scenarios).
Hyperparameters tuned on TRD generalize well to other datasets (RealtimeQA, HotpotQA), indicating robustness.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation) pipelines
Familiarity with knowledge conflicts (parametric vs. non-parametric memory)
Basic concepts of In-Context Learning (ICL) and Reinforcement Learning (GRPO)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Parametric Knowledge: Internal knowledge stored within the LLM's weights during training

External Knowledge: Information retrieved from outside sources (e.g., Wikipedia) during inference

Soft Bias: A predicted probability distribution representing how much the system should rely on retrieval vs. generation for a specific question

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to tune the Allocator module

BGE-m3: A specific embedding model used to compute similarity scores between text segments

ColBERT: A retrieval model that uses late interaction of token embeddings for fine-grained similarity matching

ICL: In-Context Learning—providing examples in the prompt to guide model behavior without weight updates