Provence: efficient and robust context pruning for retrieval-augmented generation

📝 Paper Summary

Modularized RAG pipeline

Provence formulates context pruning as a sequence labeling task and unifies it with reranking into a single model, enabling efficient removal of irrelevant context sentences without computational overhead.

Core Problem

RAG systems suffer from computational overhead due to long contexts and hallucinations caused by irrelevant retrieved information, yet existing pruners are often inefficient, lack robustness across domains, or require fixed compression ratios.

Why it matters:

Processing long contexts in LLMs increases latency and cost
Irrelevant information in retrieved contexts can propagate into generated answers (hallucinations)
Existing pruners often require heavy LLMs or separate inference steps, adding latency rather than reducing it

Concrete Example: A retriever fetches 5 documents, but only 2 sentences in the 3rd document are relevant. Standard RAG feeds all 5 documents to the generator. Extractive baselines like RECOMP might pick a fixed top-k sentences, missing relevant info if k is too low or including noise if too high. Provence dynamically selects only the relevant sentences via a learned mask.

Key Novelty

Unified Reranking and Context Pruning (Provence)

Formulates context pruning as a binary sequence labeling task (predicting a keep/discard mask per token) rather than autoregressive generation or ranking
Unifies pruning with reranking by adding a pruning head to a cross-encoder reranker, allowing both tasks to share the same forward pass (zero-cost pruning)
Trains on silver labels generated by a powerful LLM (Llama-3) that is instructed to answer questions while citing relevant sentences

Architecture

Overview of the Provence training and inference pipeline, contrasting the standalone pruner with the unified reranker+pruner.

Evaluation Highlights

Achieves negligible performance drop (or improvement) at ~50-80% compression rates across diverse domains (BioASQ, SyllabusQA, etc.)
Outperforms LLMLingua2 and RECOMP baselines on the trade-off between compression and QA performance
Unified model matches the reranking performance of the baseline DeBERTa-v3 reranker while adding pruning capabilities

Breakthrough Assessment

8/10

Highly practical contribution. Unifying pruning and reranking solves the efficiency bottleneck of separate pruning modules. Strong empirical results across diverse domains makes it immediately deployable.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering with Retrieval

Inputs: A query q and a set of retrieved passages P

Outputs: A scalar relevance score for reranking and a binary mask selecting relevant tokens for pruning

Pipeline Flow

Retriever (SPLADE-v3)
Unified Reranker & Pruner (Provence)
Generator (LLaMA-2-7B-chat)

System Modules

Retriever (Retrieval & Selection)

Fetch top-5 candidate passages from the datastore

Model or implementation: SPLADE-v3

Unified Reranker & Pruner (Provence) (Retrieval & Selection)

Simultaneously re-order passages and predict a binary mask to prune irrelevant tokens

Model or implementation: DeBERTa-v3 (fine-tuned)

Generator

Generate the final answer using the pruned context

Model or implementation: LLaMA-2-7B-chat

Novel Architectural Elements

Dual-head architecture on a cross-encoder: one head for ranking (scalar score) and one for pruning (sequence labeling) sharing the same encoder backbone
Integration of pruning directly into the reranking step, eliminating the need for a separate pruning inference pass

Modeling

Base Model: DeBERTa-v3 (base and large variants)

Training Method: Multi-task Fine-tuning (Distillation for ranking, Supervised Learning for pruning)

Objective Functions:

Purpose: Train the model to distinguish relevant tokens.

Formally: Binary Cross Entropy loss L_pruning over token labels y_n and predictions z_n
Purpose: Maintain reranking capability by distilling scores from the original reranker.

Formally: Mean Squared Error loss L_ranking between teacher score s_n and student score z_{n,0}
Purpose: Combine objectives.

Formally: L = L_pruning + lambda * L_ranking

Adaptation: Fine-tuning of the full encoder and task heads

Trainable Parameters: All parameters of DeBERTa-v3

Training Data:

MS MARCO train set (370k queries)
Silver labels generated by prompting Llama-3-8B-Instruct to answer questions and cite relevant sentences

Key Hyperparameters:

learning_rate: 3e-6
batch_size: 48
epochs: 1
+ 2 more
reranking_regularization_lambda: 0.05
pruning_threshold_T: 0.1 or 0.5 (inference)

Comparison to Prior Work

vs. RECOMP (Extractive): Provence uses cross-encoding (richer interaction) and dynamic selection (mask) instead of fixed top-k
vs. LLMLingua: Provence is an encoder-only model (faster) and removes semantic units (sentences) rather than just tokens, trained specifically for relevance
vs. DSLR: Provence can select groups of sentences and is unified with reranking training, whereas DSLR processes sentences independently [cited as concurrent]

Limitations

Reliance on silver labels generated by Llama-3 means the pruner inherits the biases or errors of the teacher LLM
Requires a threshold hyperparameter T at inference (though shown to be robust)
Performance drops at the very beginning/end of contexts in needle-in-haystack tests (positional bias)

Reproducibility

Code: https://huggingface.co/naver/provence-reranker-debertav3-v1

Publicly available: Trained model released on HuggingFace. Paper describes silver label generation prompt (Appendix). Missing: Explicit training time/GPU hours.

📊 Experiments & Results

Evaluation Setup

RAG pipeline with SPLADE-v3 retriever and LLaMA-2-7B-chat generator

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
HotpotQA (Multi-hop reasoning QA)
BioASQ (Biomedical QA)
PopQA (Long-tail QA)
SyllabusQA (Course logistics QA)

Metrics:

LLM-eval (correctness)
Compression rate (%)
Inference latency / MFLOPS
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing Provence achieves high compression with minimal performance loss compared to baselines on Natural Questions.
Natural Questions	LLM-eval Score	0.48	0.48	0.00
Natural Questions	LLM-eval Score	0.43	0.48	+0.05
Efficiency analysis shows substantial speedups due to compression.
Inference Latency	Speed-up	1.0	2.13	+1.13
Computational Cost	MFLOPS	175000	900	-174100
Reranking performance is preserved despite adding the pruning objective.
MS MARCO Dev	MRR@10	44.1	44.0	-0.1

Experiment Figures

Pareto frontiers (Scatter plots) of LLM-eval performance vs. Compression Rate for multiple datasets (NQ, HotpotQA, BioASQ, etc.).

Analysis of Provence's robustness: (Left) Needle-in-haystack position accuracy, (Middle) Predicted sentence count vs. Oracle, (Right) Performance across context granularities.

Main Takeaways

Provence enables context pruning with negligible to no drop in performance across various domains (BioASQ, PopQA, HotpotQA) at ~50-80% compression rates.
The unified model effectively transfers knowledge between reranking and pruning, preserving reranking quality while adding pruning at zero additional inference cost.
Robust to context length and position of relevant information (demonstrated via needle-in-haystack analysis), though struggles slightly at extreme edges.
Outperforms specialized baselines (RECOMP, LLMLingua2) in the compression-performance trade-off.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) pipeline components (Retriever, Reranker, Generator)
Transformer-based Cross-Encoders (like DeBERTa)
Sequence labeling vs. Autoregressive generation

Key Terms

context pruning: Removing irrelevant parts (sentences or tokens) from retrieved documents before feeding them to the LLM generator

cross-encoder: A model that processes the query and document together (concatenated) to capture deep interactions, usually for reranking

sequence labeling: A task where the model predicts a label (e.g., keep/discard) for every token in the input sequence

silver labels: Training labels generated automatically by another model (here, Llama-3) rather than human annotation

DeBERTa: Decoding-enhanced BERT with disentangled attention—a specific transformer architecture used here as the backbone encoder

SPLADE: Sparse Lexical and Expansion Model—a sparse retrieval method used to fetch initial candidates

MFLOPS: Million Floating-point Operations Per Second—a measure of computational complexity