Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

📝 Paper Summary

Vision-Language Models (VLMs) Table Question Answering Robustness Evaluation

MirageTVQA is a multilingual, visually noisy benchmark that reveals state-of-the-art VLMs suffer a 35% performance drop on realistic document scans and fail to transfer reasoning skills outside English.

Core Problem

Current VLM benchmarks for tables utilize digitally perfect, monolingual images, failing to capture the visual noise (scans, blur) and linguistic diversity of real-world documents.

Why it matters:

Real-world workflows involve scanned, imperfect documents where OCR (Optical Character Recognition) errors propagate, requiring robust end-to-end visual reasoning
Existing benchmarks like FinQA are text-based, ignoring visual layout, while multimodal benchmarks like MMTab are clean and English-centric
There is a gap in understanding how well reasoning capabilities transfer to low-resource languages in visual contexts

Concrete Example: A VLM might correctly answer a question from a clean digital PNG of a financial table. However, applying Gaussian blur or 'scan lines' (simulating a real scanned receipt) causes the model to fail (EM drops from 25.52% to 16.50%), even though the text remains legible to humans.

Key Novelty

MirageTVQA Benchmark

Constructs a large-scale dataset (60,000 QA pairs) across 24 languages using a translate-refine-filter pipeline with back-translation verification
Introduces a visually-rich rendering pipeline that applies realistic noise (blur, skew, compression, scan artifacts) to simulate real-world document degradation
Evaluates VLMs on two specific axes simultaneously: visual robustness against noise and multilingual reasoning transfer

Evaluation Highlights

Over 35% performance drop for the best model (Qwen2.5-VL-72B) when processing noisy images compared to clean ones (25.52% vs 16.50% EM)
Severe English-first bias: Performance peaks on English and degrades sharply even for high-resource languages, becoming negligible for low-resource ones
Strong correlation between model scale and reasoning capability, with Qwen2.5-VL-72B achieving the highest average EM of 13.57% across all languages

Breakthrough Assessment

9/10

Identifies a critical blind spot in VLM evaluation (visual noise + multilingualism) and provides a rigorous, large-scale benchmark to measure it. The reported failure modes are significant for deployment.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering (VQA) over tabular data in multiple languages and visual conditions

Inputs: Table image I (clean or noisy) and natural language question q

Outputs: Exact answer a (text or value) derived from the table

Pipeline Flow

Input Image (Clean or Noisy Table)
Vision-Language Model (End-to-End Processing)
Answer Generation

System Modules

Vision-Language Model

Process the visual table and textual question to generate an answer directly

Model or implementation: Evaluated Models: Qwen2.5-VL-72B, etc.

Modeling

Base Model: Qwen2.5-VL-72B (primary evaluated model)

Training Method: Not applicable — this is a benchmark paper evaluating pre-trained models

Training Data:

Source: 3000 English tables from WikiSQL, FinQA, arXiv, GitHub
Selection: Filtered to 250 seed tables based on cell word count
Translation: Qwen3-32B translation + Gemini 2.5 Pro refinement + Back-translation filtering
Visuals: HTML rendering + 40 CSS themes + imgaug noise injection
QA: Human seed -> Gemini expansion -> Human validation
Total: 80,520 QA pairs (244 tables * 30 languages * 11 QA pairs)

Compute: Not reported in the paper

Comparison to Prior Work

vs. WikiTableQuestions/FinQA: MirageTVQA tests visual reasoning on images, not linearized text/HTML
vs. MMTab/MTabVQA: MirageTVQA introduces realistic visual noise (blur, scan artifacts) and covers 24 languages, whereas others are clean and English-only
vs. M3TQA: MirageTVQA uses image inputs, whereas M3TQA focuses on text-based table representations for multilingual tasks

Limitations

Lack of interpretability methods to explain specifically why noise causes degradation
Evaluation scope limited to 25 languages and open-source models; proprietary models not tested
Analysis does not suggest specific methods to reduce the observed degradation

Reproducibility

Code: https://github.com/anshulsc/MirageTVQA

Dataset and code publicly available at https://github.com/anshulsc/MirageTVQA. Generation prompts for translation and QA creation are provided in the Appendix. Evaluation scripts implied to be in repo.

📊 Experiments & Results

Evaluation Setup

Visual Question Answering on tables with varying visual quality (clean vs. noisy) and languages

Benchmarks:

MirageTVQA (Multilingual Visual Table QA) [New]

Metrics:

Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Baseline performance on clean images establishes the capabilities of state-of-the-art models before noise is introduced.
MirageTVQA (Clean, English)	Exact Match (EM)	Not reported in the paper	25.52	Not reported in the paper
Impact of visual noise: Comparing performance on clean vs. noisy versions of the same tables reveals severe brittleness.
MirageTVQA (English)	Exact Match (EM)	25.52	16.50	-9.02
Multilingual performance aggregation shows the overall difficulty of the benchmark across 24 languages.
MirageTVQA (All Languages)	Average Exact Match (EM)	Not reported in the paper	13.57	Not reported in the paper

Experiment Figures

Radar charts (polygons) showing model performance across different languages

Main Takeaways

Visual noise causes a severe performance drop (>35%) even in state-of-the-art models, proving that performance on clean synthetic data does not transfer to realistic document scenarios.
Models exhibit a strong English-first bias; reasoning capabilities do not transfer effectively to other languages, even high-resource ones.
Model scale correlates with reasoning capability on clean data, but even the largest models (72B) are brittle when faced with visual degradation.
Current VLMs treat visual and linguistic challenges separately; there is a failure to generalize complex, visually grounded reasoning to non-English contexts.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Optical Character Recognition (OCR)
Evaluation metrics for QA (Exact Match)

Key Terms

VLMs: Vision-Language Models—AI models capable of processing both images and text to reason about visual content

OCR: Optical Character Recognition—technology that converts different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera, into editable and searchable data

EM: Exact Match—an evaluation metric that counts a prediction as correct only if it is identical to the ground truth answer

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which has been machine-translated from one natural language to another

Visual Noise: Distortions applied to images to mimic real-world imperfections, such as Gaussian blur, salt-and-pepper noise, skew, and JPEG compression

Qwen2.5-VL: A specific family of large Vision-Language Models developed by Alibaba Cloud

Back-translation: Translating a translated text back to the original language to verify accuracy by comparing it with the original source