Single-Pass Document Scanning for Question Answering

📝 Paper Summary

Modularized RAG pipeline Long-context retrieval

The Single-Pass Scanner uses a state-space model to process entire documents in linear time, identifying relevant sentences by conditioning on the full preceding context rather than isolated chunks.

Core Problem

Processing extremely large documents for QA is difficult: chunk-based embeddings lose global context, while full-context transformers suffer from prohibitive quadratic costs.

Why it matters:

Standard RAG splits documents into short chunks, losing connections between distant parts of the text necessary for answering complex questions
Full-context LLMs like GPT-4o are too expensive and slow to process hundreds of thousands of tokens for every query

Concrete Example: Chunk-based methods might retrieve a passage mentioning a character's action but miss the motivation explained 200 pages earlier. The Single-Pass Scanner reads the whole book at once to link these dependencies.

Key Novelty

Linear-Time Full-Context Scanning via State-Space Models

Adapts the Mamba-2 architecture to scan a concatenated query and document in a single pass, maintaining a running hidden state of the entire context
Replaces the language modeling head with a binary classification head that scores every sentence's relevance based on all tokens that came before it
Introduces a 'link-based' synthetic data generation method that creates training questions requiring information from two distant, thematically linked document chunks

Architecture

Illustration of the Single-Pass Scanner processing a long document. It scans the concatenated Query + Document in one pass and assigns a relevance score to each sentence.

Evaluation Highlights

Outperforms state-of-the-art embedding models (NV-Embed-v2-7B, Stella-1.5B) across 41 long-document QA benchmarks while using fewer FLOPs
Achieves performance comparable to GPT-4o on documents >256k tokens while retrieving only ~1,600 tokens (50 sentences)
Generalizes significantly beyond its 10k token training length, handling up to 256k tokens effectively

Breakthrough Assessment

8/10

Strong empirical results outperforming top MTEB leaders with a much faster, linear-complexity architecture. The ability to generalize from 10k training context to 256k test documents is particularly impressive.

⚙️ Technical Details

Problem Definition

Setting: Long-document Question Answering via relevance classification

Inputs: Query Q and long document D

Outputs: Top-k relevant sentences from D to be fed into a generator

Pipeline Flow

Input Processing: Concatenate Query + Document
Scanning: Mamba-2 backbone processes sequence
Selection: Classification head scores each sentence
Generation: Top-k sentences passed to LLM

System Modules

Single-Pass Scanner

Process the entire document conditioned on the query to identify relevant sentences

Model or implementation: Mamba-2 (130M or 1.3B parameters) with binary classification head

Generator

Synthesize the final answer from the selected sentences

Model or implementation: GPT-4o or Llama-3.1

Novel Architectural Elements

Replacement of Mamba-2's language modeling head with a sentence-level binary classification head
Single-pass scoring where sentence relevance is conditioned on the entire preceding document stream rather than isolated chunks

Modeling

Base Model: Mamba-2 (130M and 1.3B variants)

Training Method: Supervised Fine-Tuning (SFT) for binary classification

Objective Functions:

Purpose: Minimize classification error for relevant vs irrelevant sentences.

Formally: Cross-entropy loss sum -w_i [r_i log(z_si) + (1-r_i) log(1-z_si)], where r_i is the binary label and w_i is a weight to upsample positive labels.

Adaptation: Fine-tuning of full model weights and new classification head

Training Data:

1 million link-based synthetic data points for 130M model
400k link-based synthetic data points for 1.3B model
Documents sourced from Project Gutenberg, Gov reports, SEC filings, legal contracts

Key Hyperparameters:

learning_rate: Optimized on validation sets (specific value not explicitly in text)
epochs: 1
training_context_length: 10k tokens

Compute: 1.3B model trained in 5 hours on 8x H100s; 130M model trained in 3 hours on 8x H100s

Comparison to Prior Work

vs. Embedding Models (NV-Embed, Stella): Single-Pass Scanner processes full document context in one go (linear time) rather than chunking and embedding locally (losing global context)
vs. Full-Context LLMs (GPT-4o): Single-Pass Scanner acts as a filter to select only relevant sentences, reducing cost while retaining accuracy
vs. Context-Aware Embeddings (Morris & Rush): Single-Pass Scanner conditions on *entire* preceding context, whereas context-aware embeddings typically see only limited neighborhood context
+ 1 more
vs. M2-BERT [not cited in paper]: M2-BERT uses Monarch Mixer for embeddings; this paper uses Mamba-2 for direct sentence classification scanning

Limitations

Relies on a generator LLM for the final answer; the scanner only selects sentences
Training limited to 10k token length (though generalizes to 256k)
Link-based data generation costs money (GPT-4o-mini used)
Requires processing the document linearly; cannot random access without scanning preceding context

Reproducibility

Code: https://github.com/MambaRetriever/MambaRetriever

publicly available (https://github.com/MambaRetriever/MambaRetriever). Code, datasets, and checkpoints are released. Synthetic data generation prompts are in Appendix. Training data sizes and sources are specified.

📊 Experiments & Results

Evaluation Setup

Long-document Question Answering

Benchmarks:

41 QA benchmark test sets (Long-document QA (Educational, Creative, Official, Conversational))

Metrics:

Accuracy (judged by GPT-4o)
Inference Speed (tokens/sec or doc/sec)
FLOPs
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Single-Pass Scanners outperform all embedding baselines on average accuracy across 41 datasets.
Average of 41 QA datasets	Accuracy	46.2	49.6	+3.4
Average of 41 QA datasets	Accuracy	44.6	49.6	+5.0
Average of 41 QA datasets	Accuracy	39.4	49.6	+10.2
Computational efficiency comparisons showing Single-Pass Scanners are competitive or better in speed and FLOPs.
Inference Efficiency	TFLOPs (with padding)	119	26	-93
Inference Efficiency	Time (s)	0.38	0.33	-0.05
Ablation on synthetic data strategy shows link-based generation is superior.
Average Accuracy	Accuracy	36.5	45.7	+9.2
Average Accuracy	Accuracy	40.9	45.7	+4.8

Experiment Figures

Performance of Single-Pass Scanner vs Baselines vs GPT-4o across increasing document lengths (up to >256k tokens).

Ablation study on context size, comparing full single-pass context vs small (sentence) and medium (1024 token) chunks.

Main Takeaways

Single-Pass Scanner generalizes to documents up to 256k tokens despite being trained on only 10k tokens, converging to GPT-4o full-context performance at extreme lengths.
Link-based synthetic data is crucial for training the scanner to recognize long-range dependencies; standard chunk-based or pair-based data leads to significantly worse performance.
The method is robust to the choice of generator, outperforming embedding baselines whether using GPT-4o or Llama-3.1 (8B/70B) for the final answer.
SPScanner 130M provides a massive efficiency gain (1.8 TFLOPs vs 119 TFLOPs for NV-Embed) while still outperforming the larger embedding model in accuracy (47.3 vs 46.2).

📚 Prerequisite Knowledge

Prerequisites

State Space Models (SSMs) / Mamba architecture
Retrieval-Augmented Generation (RAG)
Contrastive learning vs. Binary classification for retrieval

Key Terms

SSM: State Space Model—a sequence model architecture that scales linearly with sequence length, unlike Transformers' quadratic scaling

Mamba-2: A specific efficient State Space Model architecture used as the backbone for the Single-Pass Scanner

FLOPs: Floating Point Operations—a measure of computational cost

link-based generation: A synthetic data creation method that finds thematically linked but distant text chunks and generates questions requiring both to answer

sliding window: A technique to process sequences longer than a model's maximum context by moving a fixed-size window over the text with some overlap

RAG: Retrieval-Augmented Generation—systems that retrieve relevant information to help an LLM answer questions

MTEB: Massive Text Embedding Benchmark—a standard leaderboard for evaluating embedding models

Contriever: A dense retrieval model trained using contrastive learning