MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval

📝 Paper Summary

Visual Document Retrieval (VDR) Vision-Language Models (VLMs)

MURE improves document retrieval by encoding images at multiple resolutions using Matryoshka learning for feature fusion and hierarchical clustering for efficient token compression.

Core Problem

Existing VDR models struggle to balance detail and efficiency: fixed-resolution models lose fine-grained visual cues, while native-resolution models generate excessive tokens, causing high storage and latency costs.

Why it matters:

High-resolution documents (e.g., maps, charts) require fine-grained perception that standard 336x336 encoding misses.
Excessive visual tokens in native-resolution approaches (like ColPali) explode index sizes, making large-scale retrieval computationally prohibitive.
Current methods fail to simultaneously capture high-level layout structure and low-level text details within a unified efficient representation.

Concrete Example: When answering 'What is the background color of the US on the map?', a model needs a coarse global view. For 'Which color represents countries in the legend?', it needs a fine-grained view. Single-resolution models typically fail at one of these, either missing the legend details or losing the global map context.

Key Novelty

X-VisEmb Paradigm (Multi-Resolution Encoding with Adaptive Distillation)

Applies an 'optical zoom' strategy by resizing document images into a hierarchy of grids (1x1 to 2x3) to capture both global structure and local details.
Uses Resolution-level Matryoshka Representation Learning (RMRL) to nest features, allowing the model to prioritize coarse features while progressively adding fine details.
Employes semantic-aware hierarchical clustering during indexing to compress redundant visual tokens into a fixed budget without retraining.

Architecture

The inference pipeline of MURE, illustrating how a document is processed from image to compressed embedding.

Evaluation Highlights

Outperforms ColMate by +1.9% (ViDoRe V1) and +2.3% (ViDoRe V2) in NDCG@5, setting a new SOTA for PaliGemma-based retrievers.
Surpasses the full-resource ColPali model using only 512 visual tokens (50% of ColPali's budget) on both benchmarks.
Maintains 95.2% of full performance on ViDoRe V1 even when compressed to just 128 tokens, demonstrating extreme storage efficiency.

Breakthrough Assessment

8/10

Successfully addresses the critical efficiency-granularity trade-off in VDR. The ability to beat SOTA with 50% fewer tokens is a significant practical advantage for deployment.

⚙️ Technical Details

Problem Definition

Setting: Visual Document Retrieval: Given a textual query q and a collection of page images P, retrieve the top-K most relevant pages.

Inputs: Query text q and document image I

Outputs: Relevance score s(q, p)

Pipeline Flow

Input Processing: Multi-Resolution Sampling -> Visual Encoder
Feature Fusion: LLM Backbone -> RMRL Projection
Compression (Offline): Hierarchical Clustering -> Compressed Index
Retrieval (Online): Query Encoding -> MaxSim Scoring

System Modules

Multi-Resolution Sampler (Input Processing)

Resizes input image into hierarchical grids (1x1, 1x2, 2x2, 2x3) to create multi-scale views

Model or implementation: Image Preprocessing

Visual Encoder (Input Processing)

Encodes image patches into latent features

Model or implementation: SigLIP-So400m (part of PaliGemma)

LLM Backbone (Feature Fusion)

Fuses multi-scale features via self-attention

Model or implementation: PaliGemma-3B (Transformer)

RMRL Projector (Feature Fusion)

Projects features into a nested embedding space where coarse levels are subsets of fine levels

Model or implementation: Linear Projection

Token Compressor

Compresses visual tokens to a fixed budget during indexing

Model or implementation: Hierarchical Agglomerative Clustering (HAC)

Novel Architectural Elements

Hierarchical multi-resolution sampling input stream (feeding 1x1, 1x2, 2x2, 2x3 grids simultaneously)
Resolution-level Matryoshka embedding structure where embedding subsets correspond to specific grid resolutions

Modeling

Base Model: PaliGemma-3B (SigLIP-So400m visual encoder + Gemma 2B LLM)

Training Method: Supervised Fine-Tuning with Contrastive Loss

Objective Functions:

Purpose: Optimize retrieval at each granularity level simultaneously.

Formally: L_total = Sum(w_k * L_contrastive(D^(k))) over all levels k, where L_contrastive is InfoNCE.

Adaptation: LoRA (rank=32, alpha=32) on LLM transformer layers

Trainable Parameters: LoRA parameters and projection layer

Training Data:

ViDoRe V1 training split (118k query-page pairs)

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 64
epochs: 2
+ 3 more
contrastive_temperature: 0.02
loss_weights: {1.0, 1.5, 2.0, 2.5}
grid_configurations: {1x1, 1x2, 2x2, 2x3}

Compute: 8 NVIDIA A100 GPUs

Comparison to Prior Work

vs. ColPali: MURE uses multi-resolution sampling + compression, achieving similar/better performance with 50% tokens.
vs. ColMate: MURE explicitly models hierarchical granularities via RMRL.
vs. Bi-Encoders: MURE retains fine-grained details via multi-vector late interaction.

Limitations

Still requires re-indexing if the granularity configuration changes fundamentally (though HAC helps adaptability).
Performance gap (3.3-11.6%) remains between MURE and the idealized 'oracle' selector from preliminary studies.
Reliance on visual tokens still incurs higher storage than single-vector dense retrievers.

Reproducibility

Implemented based on ColPali codebase. Code availability is not explicitly provided in the snippet (no URL). Standard ViDoRe benchmarks used.

📊 Experiments & Results

Evaluation Setup

Retrieval on visually rich documents (PDFs, slides, figures).

Benchmarks:

ViDoRe V1 (In-domain Visual Document Retrieval (10 datasets))
ViDoRe V2 (Out-of-domain/Multilingual Visual Document Retrieval (7 datasets))

Metrics:

NDCG@5
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of MURE-Full against strong PaliGemma-based baselines shows SOTA performance.
ViDoRe V1	NDCG@5	Not reported in the paper	Not reported in the paper	+1.9%
ViDoRe V2	NDCG@5	Not reported in the paper	Not reported in the paper	+2.3%
Efficiency analysis showing robustness under extreme token compression.
ViDoRe V1	NDCG@5	87.0	82.8	-4.2
ViDoRe V2	NDCG@5	59.5	50.5	-9.0
Comparison against ColPali with controlled token budgets.
ViDoRe V1	NDCG@5	Not reported in the paper	Not reported in the paper	+1.5%

Experiment Figures

Performance (NDCG@5) vs. Number of Visual Tokens (Storage Cost) on ViDoRe V1 and V2.

Contribution analysis of different granularity levels across datasets.

Main Takeaways

Multi-resolution sampling works like an 'optical zoom', where medium granularities (1x2, 2x2) contribute the most (78.5%) to retrieval score, balancing detail and context.
Granularity importance is task-dependent: complex charts (InfoQ) rely heavily on 2x2 grids, while academic papers (ArxivQA) benefit disproportionately from fine 2x3 grids.
The semantic-aware clustering allows MURE to beat full-scale baselines with 50% fewer tokens, identifying 512 tokens as the efficiency 'sweet spot'.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Visual Document Retrieval (VDR)
Contrastive Learning (InfoNCE)
Matryoshka Representation Learning

Key Terms

VDR: Visual Document Retrieval—retrieving document images based on visual and textual content without OCR

VLM: Vision-Language Model—a model that processes both images and text, used here as the document encoder

RMRL: Resolution-level Matryoshka Representation Learning—a technique to structure embeddings so that coarse-to-fine resolutions are nested, allowing flexible usage

MaxSim: A late-interaction scoring mechanism that sums the maximum similarity scores between query tokens and document visual tokens

ColPali: A baseline VDR model based on PaliGemma that uses late interaction on native-resolution image patches

HAC: Hierarchical Agglomerative Clustering—an algorithm used here to merge similar visual tokens to reduce storage size

NDCG@5: Normalized Discounted Cumulative Gain at 5—a measure of ranking quality that accounts for the position of relevant items