← Back to Paper List

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

Anshul Singh, Rohan Chaudhary, Gagneet Singh, Abhay Kumary
Indian Institute of Science, Bangalore, Panjab University, Chandigarh
arXiv (2025)
MM Benchmark QA Reasoning

📝 Paper Summary

Vision-Language Models (VLMs) Table Question Answering Robustness Evaluation
MirageTVQA is a multilingual, visually noisy benchmark that reveals state-of-the-art VLMs suffer a 35% performance drop on realistic document scans and fail to transfer reasoning skills outside English.
Core Problem
Current VLM benchmarks for tables utilize digitally perfect, monolingual images, failing to capture the visual noise (scans, blur) and linguistic diversity of real-world documents.
Why it matters:
  • Real-world workflows involve scanned, imperfect documents where OCR (Optical Character Recognition) errors propagate, requiring robust end-to-end visual reasoning
  • Existing benchmarks like FinQA are text-based, ignoring visual layout, while multimodal benchmarks like MMTab are clean and English-centric
  • There is a gap in understanding how well reasoning capabilities transfer to low-resource languages in visual contexts
Concrete Example: A VLM might correctly answer a question from a clean digital PNG of a financial table. However, applying Gaussian blur or 'scan lines' (simulating a real scanned receipt) causes the model to fail (EM drops from 25.52% to 16.50%), even though the text remains legible to humans.
Key Novelty
MirageTVQA Benchmark
  • Constructs a large-scale dataset (60,000 QA pairs) across 24 languages using a translate-refine-filter pipeline with back-translation verification
  • Introduces a visually-rich rendering pipeline that applies realistic noise (blur, skew, compression, scan artifacts) to simulate real-world document degradation
  • Evaluates VLMs on two specific axes simultaneously: visual robustness against noise and multilingual reasoning transfer
Evaluation Highlights
  • Over 35% performance drop for the best model (Qwen2.5-VL-72B) when processing noisy images compared to clean ones (25.52% vs 16.50% EM)
  • Severe English-first bias: Performance peaks on English and degrades sharply even for high-resource languages, becoming negligible for low-resource ones
  • Strong correlation between model scale and reasoning capability, with Qwen2.5-VL-72B achieving the highest average EM of 13.57% across all languages
Breakthrough Assessment
9/10
Identifies a critical blind spot in VLM evaluation (visual noise + multilingualism) and provides a rigorous, large-scale benchmark to measure it. The reported failure modes are significant for deployment.
×