CUB: Benchmarking Context Utilisation Techniques for Language Models

📝 Paper Summary

RAG robustness Context utilisation

CUB is a comprehensive benchmark evaluating seven context manipulation techniques across nine language models to diagnose how well they handle relevant, conflicting, and irrelevant information in retrieval-augmented generation.

Core Problem

Language models in RAG systems often ignore relevant information that conflicts with their internal memory or get distracted by irrelevant contexts, and existing mitigation techniques are evaluated in isolation on narrow tasks.

Why it matters:

Real-world retrieval systems are imperfect and often return irrelevant data, distracting the model from the correct answer
Information changes over time (e.g., a new Prime Minister), requiring models to prioritize retrieved context over outdated internal memory
Current evaluations are fragmented, making it unclear which techniques work across diverse scenarios like gold, conflicting, and irrelevant contexts

Concrete Example: When asked a question where the answer has changed (e.g., 'Who is the CEO of X?'), a model might ignore the retrieved document stating the new CEO (conflicting context) and hallucinate the old CEO from its training data. Conversely, if the retrieval system returns a document about a different company (irrelevant context), the model might mistakenly use that information instead of its correct internal knowledge.

Key Novelty

Context Utilisation Benchmark (CUB)

First unified benchmark explicitly designed to diagnose Context Utilisation Manipulation Techniques (CMTs) across three distinct context types: gold (relevant), conflicting (contradicts memory), and irrelevant (noise)
Systematic evaluation of 7 diverse CMTs (prompting, fine-tuning, decoding, mechanistic) across 9 LMs to reveal trade-offs between robustness and faithfulness

Architecture

Overview of the CUB benchmark framework showing the input types, diverse LMs, and the 7 CMTs being evaluated.

Evaluation Highlights

No single technique excels everywhere: PH3+context improves faithfulness to conflicting contexts but degrades performance on irrelevant contexts compared to regular decoding
Regular model performance on the CounterFact dataset decreases as model size increases, contradicting standard scaling laws observed on NQ and DRUID
Prompting-based methods and Multi-agent approaches show the most stable performance across all context types, avoiding the dramatic failure modes of more complex interventions

Breakthrough Assessment

8/10

Establishment of a much-needed benchmark for RAG robustness. The finding that no current method handles both conflicting and irrelevant contexts well exposes a major gap in the field.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation (RAG) where a generator receives a query and context, which may be relevant, conflicting, or irrelevant

Inputs: Query q and retrieved context C (which can be gold, conflicting, or irrelevant)

Outputs: Generated answer token or probability distribution over tokens

Pipeline Flow

Input Selection (Gold/Conflicting/Irrelevant)
CMT Application (Prompting/Decoding/Fine-tuning/Intervention)
Model Inference (9 different LMs)
Metric Calculation (BCU)

System Modules

Input Curator

Selects query and context pair based on type: Gold (supports ground truth), Conflicting (supports counterfactual), or Irrelevant (random/reranked noise)

Model or implementation: Script-based selection from CounterFact, NQ, DRUID

CMT Applicator

Applies the specific manipulation technique (e.g., modifying logits, modifying prompt, suppressing attention heads)

Model or implementation: Various (Prompting, Fine-tuning adapter, PH3 intervention, ACD decoding)

Evaluator

Calculates Binary Context Utilisation (BCU) score

Model or implementation: Deterministic scoring

Novel Architectural Elements

Benchmarking framework integrating 3 distinct context types (Gold, Conflicting, Irrelevant) to diagnose specific failure modes of RAG systems

Modeling

Base Model: 9 LMs: GPT-2 XL, Pythia (6.9B), Qwen2.5 (1.5B, 7B, 32B - Base & Instruct), Cohere Command A (111B)

Training Method: Supervised Fine-Tuning (for the 'Fine-tuning' CMT only)

Objective Functions:

Purpose: Teach model to use relevant context and ignore irrelevant context.

Formally: Standard Cross-Entropy Loss on a dataset containing relevant, irrelevant, empty, and conflicting contexts.

Adaptation: LoRA (assumed based on standard practice for these sizes, though paper specifies 'fine-tuning approach of Li et al.')

Training Data:

Validation sets of CounterFact, NQ, DRUID used for hyperparameter tuning
Fine-tuning data follows Li et al. (2023) approach with 4 context types

Key Hyperparameters:

inference_decoding: Greedy decoding (implied by deterministic BCU scoring)

Compute: Inference on models ranging from 1.5B to 111B parameters. Fine-tuning requires training resources (high cost).

Comparison to Prior Work

vs. RAG-Bench: CUB evaluates conflicting knowledge in addition to noise
vs. KILT: CUB isolates the generator's context usage from the retriever's performance
vs. AxBench: CUB specifically targets RAG scenarios (gold/conflict/irrelevant) rather than general steering
+ 1 more
vs. Self-RAG [not cited in paper]: Self-RAG trains specific tokens for critique; CUB evaluates existing methods like Multi-agent and decoding interventions without architecture changes

Limitations

No single CMT solves all context types; trade-off between conflicting and irrelevant context handling
CounterFact dataset shows inverse scaling (larger models perform worse), suggesting artificiality issues
Limited to binary outcome metrics (BCU) for main analysis, though continuous metrics (CCU) provided in appendix
Reliance on synthesized data for conflicting contexts in NQ (substitution approach)

Reproducibility

Code: https://github.com/copenlu/cub-counterfact

Code available (GitHub link in paper). Datasets (copenlu/cub-*) on Hugging Face. Pre-defined hyperparameter search method provided. Prompts (12 per dataset) generated by experts and LLMs. Cohere Command A is a closed API model.

📊 Experiments & Results

Evaluation Setup

Controlled RAG simulation with Gold, Conflicting, and Irrelevant contexts

Benchmarks:

CounterFact (Diagnostic (Knowledge Conflict))
Natural Questions (NQ) (Open-domain QA)
DRUID (Automated Fact-Checking)

Metrics:

Binary Context Utilisation (BCU)
Continuous Context Utilisation (CCU)
Delta BCU (Improvement over Regular)
Statistical methodology: Spearman's rho for feature correlation analysis

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of models on conflicting contexts in the CounterFact dataset highlights how simple interventions can force perfect utilization in synthetic settings.
CounterFact	BCU (Conflicting)	0.38	1.00	+0.62
Analysis of trade-offs shows that methods improving conflicting context usage often hurt irrelevant context robustness.
Aggregated (All Datasets)	Delta BCU	0.00	0.15	+0.15
Aggregated (All Datasets)	Delta BCU	0.00	-0.05	-0.05
Correlation analysis reveals what factors influence context utilization.
CounterFact	Spearman's rho (Regular)	0.00	-0.44	-0.44

Experiment Figures

Radar charts or bar charts showing the BCU scores of different CMTs across the three context types for NQ and CounterFact.

Net gain (Delta) of each CMT compared to Regular decoding, aggregated across models.

Main Takeaways

Most CMTs show inflated performance on synthetic datasets (CounterFact) compared to realistic ones (NQ, DRUID), often reaching 100% on the former while struggling on the latter.
A fundamental trade-off exists: CMTs like PH3+context excel at conflicting contexts but fail at irrelevant ones, while ACD handles irrelevant ones well but fails at conflicts.
Prompting and Multi-agent approaches are the most robust, providing stable improvements across context types without the severe degradations seen in decoding or mechanistic methods.
Instruction tuning negatively correlates with performance on conflicting CounterFact contexts, likely because tuned models are more critical of the 'fake' synthetic facts.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Language Model decoding strategies (logits, entropy)
Mechanistic interpretability (attention heads)

Key Terms

CMT: Context Utilisation Manipulation Technique—any method (prompting, fine-tuning, decoding, etc.) designed to control how an LM uses provided context

Context Utilisation: The extent to which an LM relies on external context versus its internal parametric memory to generate an answer

Gold context: Retrieved information that is relevant and factually correct

Conflicting context: Retrieved information that is relevant but contradicts the LM's pre-existing internal memory (e.g., updated facts)

Irrelevant context: Retrieved information that provides no help in answering the query and acts as a distractor

BCU: Binary Context Utilisation—a metric scoring 1 if the model outputs the context-supported answer (for relevant/conflicting) or the memory-supported answer (for irrelevant), and 0 otherwise

Parametric memory: Knowledge stored within the model's pre-trained weights

CounterFact: A diagnostic dataset containing synthesized facts designed to conflict with an LM's internal knowledge

LAMA: Language Model Analysis—a dataset of facts used to probe what LMs know

Mechanistic intervention: Modifying internal model components (like attention heads) during inference to steer behavior

Contrastive decoding: Adjusting the probability of the next token by comparing the logits of the model with context vs. without context