Opendecoder: Open large language model decoding to incorporate document quality in rag

📝 Paper Summary

Modularized RAG pipeline

OpenDecoder modifies the LLM's internal attention mechanism during decoding by explicitly injecting external relevance scores (retriever, ranker, QPP) to downweight noisy or irrelevant retrieved documents.

Core Problem

Standard RAG models assume retrieved documents are relevant and process them using standard self-attention, which fails to distinguish between useful evidence and noise when retrieval quality varies.

Why it matters:

Retrieval systems frequently return irrelevant or noisy documents, which degrades LLM generation quality and causes hallucinations.
Existing methods rely on prompting or black-box fine-tuning, but the internal attention mechanism still treats all input tokens as potentially relevant context without explicit quality guidance.
Prompt-based filtering strategies are sensitive to templates and increase latency, while standard fine-tuning doesn't structurally change how the model attends to noise.

Concrete Example: When an LLM is asked a question but the retriever returns completely irrelevant documents, a standard RAG model might hallucinate an answer based on the noise or its internal parametric knowledge without knowing which to trust. OpenDecoder uses explicit low relevance scores to force the attention mechanism to ignore the retrieved context and rely on internal knowledge.

Key Novelty

Explicit Indicator-Guided Decoding

Injects external quality signals (retriever scores, ranker scores, query performance prediction) directly into the attention mask during generation.
Modulates the attention scores so the model structurally attends less to tokens from documents marked as low-quality by external evaluators.
Trains the model to utilize these injected scores via a robustness training curriculum that mixes relevant, partially relevant, and irrelevant documents.

Architecture

Comparison of Vanilla RAG decoding vs. OpenDecoder. Shows how OpenDecoder takes external scores (Retriever, Ranker, QPP), normalizes them, and injects them into the Attention mechanism.

Evaluation Highlights

Outperforms vanilla RAG and robust baselines (like RobustRAG and RbFT) across 5 QA benchmarks in noisy settings.
Achieves higher F1 scores in 'Extreme Noisy' settings (100% irrelevant documents) by effectively ignoring noise.
Demonstrates that combining multiple indicators (Retriever + Ranker + QPP) yields better performance than single indicators.

Breakthrough Assessment

7/10

Novel architectural modification to the attention mechanism for RAG robustness. Moves beyond simple prompting or filtering to structural integration of relevance signals. Strong empirical results on standard benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where input documents have variable relevance (noise).

Inputs: User query q, set of retrieved documents {doc_i}, and external quality scores (relevance, ranker, QPP).

Outputs: Generated answer a.

Pipeline Flow

Indicator Construction: Extract features (Retriever score, Ranker score, QPP) for retrieved documents.
Score Normalization: Normalize document scores to [0,1]; assign 1.0 to query/instruction.
Modified Decoding: LLM generates answer using modified attention where attention scores are modulated by the normalized indicator matrix.

System Modules

Indicator Constructor

Calculates quality scores for each retrieved document.

Model or implementation: Various (Retriever: E5, Ranker: LLM-based, QPP model)

Modified Generator

Generates the answer using explicit indicators to bias attention.

Model or implementation: Qwen-2.5-3B-Instruct (modified attention)

Novel Architectural Elements

Integration of external scalar indicators directly into the attention computation graph as a bias term (log(S_norm)) added to attention logits.
Token-level score matrix assignment where document tokens inherit their document's relevance score, while query/instruction tokens get score 1.0.

Modeling

Base Model: Qwen-2.5-3B-Instruct

Training Method: Supervised Fine-Tuning with modified attention mechanism

Objective Functions:

Purpose: Maximize probability of ground truth answer given query, documents, and indicators.

Formally: Standard Cross-Entropy Loss on tokens.

Adaptation: Full fine-tuning of attention parameters (theta_open^attn) to learn to utilize indicators

Trainable Parameters: Model parameters initialized from Qwen-2.5, fine-tuned for 1 epoch.

Training Data:

Merged training sets of NQ and HotpotQA.
Robustness training data created by replacing 2nd half of top-10 docs with partial/irrelevant docs.
Positions of noisy documents optionally shuffled.

Key Hyperparameters:

retrieved_documents_k: 10
training_epochs: 1
indicator_weights: 0.5 for Ranker/QPP, 1.0 for Retriever (implicit in text description of aggregation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RobustRAG/InstructRAG: OpenDecoder modifies the attention mechanism itself rather than relying on prompt-based workflows or multi-step filtering.
vs. RbFT: Uses external signals (scores) explicitly in decoding rather than just training on noisy data with instructions.
vs. REFRAG [cited in paper]: REFRAG compresses context; OpenDecoder keeps context but modulates attention weights.

Limitations

Relies on the quality of external indicators; if indicators are wrong, attention might be misguided.
Requires modification of the model architecture (attention block), which may complicate deployment compared to standard black-box LLM API usage.
Evaluated only on Qwen-2.5-3B; scaling to larger models not explicitly tested in this paper.

Reproducibility

Code: https://github.com/fengranMark/OpenDecoder

Code is publicly available at https://github.com/fengranMark/OpenDecoder. Uses Qwen-2.5-3B-Instruct backbone. Uses Wikipedia 2018 dump and E5 retriever.

📊 Experiments & Results

Evaluation Setup

Open-domain QA and Multi-hop QA under Normal, Noisy, and Extreme Noisy retrieval conditions.

Benchmarks:

Natural Questions (NQ) (General QA)
TriviaQA (General QA)
PopQA (General QA (Long-tail))
HotpotQA (Multi-hop QA)
2WikiMultiHopQA (Multi-hop QA)

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experimental results are generally described as outperforming baselines, but the provided paper text does not contain the specific result tables with numeric values. The text states 'Our OpenDecoder outperforms vanilla RAG and other strong baselines... consistently'. Specific numbers are referenced as being in the experimental section, but the table content itself is not in the provided snippet.

Main Takeaways

OpenDecoder consistently improves robustness against noisy retrieval compared to vanilla RAG and robust baselines.
The method is effective in 'Extreme Noisy' settings where all retrieved documents are irrelevant, likely by allowing the model to revert to parametric knowledge.
Combining multiple indicators (Retriever + Ranker + QPP) provides better guidance than using single indicators alone.

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanism (Self-Attention)
Retrieval-Augmented Generation (RAG)
Supervised Fine-Tuning (SFT)

Key Terms

QPP: Query Performance Prediction—estimating the difficulty of a query or the quality of retrieval results before generation.

OpenDecoder: The proposed method that modifies the attention mechanism to incorporate explicit relevance indicators.

RbFT: Robustness Fine-Tuning—a baseline method that trains models to detect defects and extract utility via instructions.

Attention Mask: A matrix used in Transformers to prevent the model from attending to certain positions; here modified to weight attention by relevance.

Vanilla RAG: Standard RAG where retrieved documents are simply concatenated into the prompt without structural modification.

Exact Match (EM): Evaluation metric measuring the percentage of predictions that match the ground truth answer exactly.

F1 score: Evaluation metric measuring the overlap between the prediction and ground truth tokens.