OSCAR: Online Soft Compression And Reranking

📝 Paper Summary

Modularized RAG pipeline Retrieval

OSCAR uses a small, fast LLM to compress retrieved documents into query-dependent embedding vectors on the fly, simultaneously reranking them to reduce computational cost in RAG.

Core Problem

Scaling RAG pipelines is computationally expensive because costs increase quadratically with the number of tokens in retrieved documents.

Why it matters:

Long contexts slow down generation significantly, making real-time RAG applications sluggish
Existing 'hard' compression (summarization/pruning) is versatile but has low compression rates
Existing 'soft' compression (embedding mapping) usually requires offline pre-computation, limiting flexibility for new queries or dynamic corpora

Concrete Example: In a standard RAG setup with 10 retrieved documents of 128 tokens each, the generator must process ~1280 tokens of context. OSCAR compresses each document into just 8 vectors, reducing the generator's input load drastically while maintaining accuracy.

Key Novelty

Query-Dependent Online Soft Compression And Reranking (OSCAR)

Compresses documents into continuous embeddings dynamically at inference time using the current query, allowing higher compression rates than static offline methods
Integrates reranking into the compression step: the compressor model outputs both compressed vectors for the generator and a relevance score for the document in a single forward pass

Architecture

Overview of OSCAR's inference pipeline. Left: Standard RAG. Right: OSCAR pipeline showing Compressor LLM taking (Query, Document) and outputting compressed embeddings + reranking score, which are then fed to the Generator LLM.

Evaluation Highlights

2.2–3.3× inference speed-up compared to uncompressed Mistral-7B baseline while achieving higher evaluation scores across 6 QA datasets
Matches performance of 'hard' compression baselines (Provence, RECOMP) while being significantly faster
Mistral-24B backbone with OSCAR-llama enables 5× decrease in computational complexity with improved results

Breakthrough Assessment

8/10

Effective unification of soft compression and reranking. The ability to perform query-dependent compression online with such high speed-ups (up to 5x) without accuracy loss is a significant practical advance for RAG.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where retrieved documents are compressed into vector representations before generation

Inputs: Query q and a set of retrieved documents {d_1, ..., d_k}

Outputs: Generated answer a and (optionally) reranking scores r_i

Pipeline Flow

Retriever (fetches top-k documents)
Compressor/Reranker (simultaneously compresses docs to embeddings and scores them)
Generator (produces answer from compressed embeddings)

System Modules

Retriever

Fetch initial candidate documents

Model or implementation: SPLADE-v3 (for main experiments)

Compressor

Map (Query, Document) pairs to compact soft embeddings and optional relevance score

Model or implementation: OSCAR-llama (Llama-3.2-1B-Instruct) or OSCAR-N-Layers (Headless Transformer)

Generator

Generate answer using compressed context

Model or implementation: Mistral-7B, Qwen2-7B, or Mistral-24B (with LoRA)

Novel Architectural Elements

Dual-purpose Compressor: A single forward pass of the compressor LLM generates both the compressed soft prompts (via memory tokens) and the reranking score (via a specialized reranking token).
Online Query-Dependent Soft Compression: Unlike prior soft compression that pre-computes static embeddings, OSCAR computes embeddings conditioned on the query at inference time.

Modeling

Base Model: Compressor: Llama-3.2-1B-Instruct or first N layers of backbone. Generator: Mistral-7B-Instruct-v0.2, Qwen2-7B-Instruct, or Mistral-Small-24B-Instruct.

Training Method: Supervised fine-tuning via distillation

Objective Functions:

Purpose: Train generator to match teacher LLM output.

Formally: Cross-entropy loss on teacher-generated labels (answer a_i) conditioned on compressed inputs.
Purpose: Train compressor to rank documents (optional).

Formally: L2 loss between predicted score r_i and teacher cross-encoder score r'_i.

Adaptation: Generator uses LoRA; Compressor uses full fine-tuning.

Training Data:

893k queries (from MS MARCO and internal datasets)
Wikipedia-KILT document collection
Teacher labels from Mistral-7B seeing full uncompressed context

Key Hyperparameters:

retrieved_docs_k_train: 5
retrieved_docs_k_inference: 10
compression_rate: 16x (128 tokens -> 8 vectors)
+ 1 more
reranking_loss_lambda: 0.05

Compute: Training/Inference on GPUs (specific hardware not explicitly reported in text, profiler used for FLOPs).

Comparison to Prior Work

vs. Provence: OSCAR uses soft embeddings instead of text pruning, allowing higher information density and compression rates.
vs. PISCO: OSCAR is online and query-dependent, whereas PISCO compresses documents offline independently of the query.
vs. FiD-light: OSCAR achieves higher compression and uses a decoder-only architecture suitable for modern LLMs.
+ 1 more
vs. COCOM [not cited in paper]: COCOM also does soft compression but focuses on offline/independent compression; OSCAR adds the query-dependent online aspect and joint reranking.

Limitations

Compressor is backbone-specific: must be retrained for every different generation LLM (unlike text-based hard compression).
Requires fine-tuning the generator LLM (or adapters), unlike plug-and-play lexical methods.
Efficiency gains diminish for very small backbones where the compressor overhead is relatively larger.

Reproducibility

Code: https://huggingface.co/collections/naver/oscar

Models available at huggingface.co/collections/naver/oscar. Training queries and distillation labels to be released. Code for training pipeline not explicitly linked but methodology is detailed.

📊 Experiments & Results

Evaluation Setup

Open-domain QA with retrieval from KILT or PUBMED

Benchmarks:

Natural Questions (Open-domain QA)
TriviaQA (Open-domain QA)
HotpotQA (Multi-hop QA)
ASQA (Ambiguous QA)
PopQA (Entity-centric QA)
BIOASQ-12B (Biomedical QA)

Metrics:

LLM-based accuracy (GPT-4 judge)
Inference Speed-up (FLOPs reduction)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing OSCAR's speed-up and performance retention against uncompressed baselines and hard compression methods.
Average across 6 datasets	LLM Score	43.3	44.6	+1.3
Average across 6 datasets	Speed-up	1.0	3.3	+2.3x
Average across 6 datasets	LLM Score	42.7	44.6	+1.9
Scaling results demonstrating OSCAR's effectiveness on larger backbones (Mistral-24B).
Average across 6 datasets	Speed-up	1.0	5.0	+4.0x
Average across 6 datasets	LLM Score	50.1	51.1	+1.0

Experiment Figures

Scatter plot of Accuracy vs. Speed-up (FLOPs) for various methods (OSCAR variants, Provence, RECOMP, No Compression).

GPT-4 Pairwise comparison win-rates against Mistral-7B baseline.

Main Takeaways

OSCAR consistently achieves 2-5x inference speed-ups with no loss (and often slight gains) in accuracy compared to uncompressed baselines.
Outperforms state-of-the-art hard compression methods (Provence, RECOMP) in both accuracy and speed.
Robust to backbone size: efficiency gains increase with larger models (e.g., 5x speedup for 24B model vs 3.3x for 7B model).
Successfully learns to rerank documents: joint training yields a model that can compress and rerank simultaneously without performance degradation.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Transformer architecture (specifically cross-attention mechanisms)
Knowledge Distillation

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Soft Compression: Mapping text not to shorter text, but to a sequence of continuous vector embeddings (soft tokens) that an LLM can process

Hard Compression: Shortening text by selecting sentences (pruning) or rewriting it (summarization) into fewer discrete tokens

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

Distillation: Training a smaller student model to mimic the outputs or internal states of a larger, more capable teacher model

Cross-Encoder: A re-ranking model that processes the query and document together to output a relevance score, typically more accurate but slower than bi-encoders

SPLADE: Sparse Lexical and Expansion Model—a sparse retrieval method that learns term importance and expansion for effective keyword-based search