Evaluation Setup
Retrieval on visually rich documents (PDFs, slides, figures).
Benchmarks:
- ViDoRe V1 (In-domain Visual Document Retrieval (10 datasets))
- ViDoRe V2 (Out-of-domain/Multilingual Visual Document Retrieval (7 datasets))
Metrics:
- NDCG@5
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of MURE-Full against strong PaliGemma-based baselines shows SOTA performance. |
| ViDoRe V1 |
NDCG@5 |
Not reported in the paper |
Not reported in the paper |
+1.9%
|
| ViDoRe V2 |
NDCG@5 |
Not reported in the paper |
Not reported in the paper |
+2.3%
|
| Efficiency analysis showing robustness under extreme token compression. |
| ViDoRe V1 |
NDCG@5 |
87.0 |
82.8 |
-4.2
|
| ViDoRe V2 |
NDCG@5 |
59.5 |
50.5 |
-9.0
|
| Comparison against ColPali with controlled token budgets. |
| ViDoRe V1 |
NDCG@5 |
Not reported in the paper |
Not reported in the paper |
+1.5%
|
Main Takeaways
- Multi-resolution sampling works like an 'optical zoom', where medium granularities (1x2, 2x2) contribute the most (78.5%) to retrieval score, balancing detail and context.
- Granularity importance is task-dependent: complex charts (InfoQ) rely heavily on 2x2 grids, while academic papers (ArxivQA) benefit disproportionately from fine 2x3 grids.
- The semantic-aware clustering allows MURE to beat full-scale baselines with 50% fewer tokens, identifying 512 tokens as the efficiency 'sweet spot'.