← Back to Paper List

MUNIChus: Multilingual News Image Captioning Benchmark

Yuji Chen, Alistair Plum, Hansi Hettiarachchi, Diptesh Kanojia, Saroj Basnet, Marcos Zampieri, Tharindu Ranasinghe
arXiv (2026)
MM Benchmark Factuality

📝 Paper Summary

News Image Captioning Multilingual Vision-Language Benchmarks
MUNIChus introduces the first large-scale multilingual news image captioning benchmark across 9 languages, revealing that instruction fine-tuning significantly outperforms few-shot prompting, though low-resource languages like Sinhala remain challenging.
Core Problem
Existing news image captioning datasets are exclusively English, while generic captioning models fail to identify specific entities (people, events) crucial for news context.
Why it matters:
  • Current models trained on generic data describe visual objects (e.g., 'a crowd') but miss the journalistic essence (e.g., 'Protest against policy X'), limiting utility for visually impaired users.
  • The lack of multilingual datasets hinders the development of news captioning systems for non-English speakers, particularly in low-resource languages like Sinhala and Urdu.
Concrete Example: For an image of a politician at a ceremony, a generic caption generates 'A crowd of people standing around each other,' whereas the correct news caption is 'Michelle O’Neill attended the Belfast ceremony alongside Deputy First Minister Emma Little-Pengelly.'
Key Novelty
MUNIChus Benchmark
  • Creation of the largest news image captioning dataset covering 9 languages and over 700,000 images sourced from BBC, including headlines and articles.
  • Comprehensive benchmarking of state-of-the-art MLLMs using both prompting (zero-shot, few-shot) and parameter-efficient fine-tuning (QLoRA) strategies.
Architecture
Architecture Figure Figure 2 (Conceptual)
The prompting setup for the Zero-shot evaluation setting.
Evaluation Highlights
  • Fine-tuned Aya-vision-8b achieves a CIDEr score of 56.34, more than doubling the best prompting-based performance (GPT-4o random few-shot).
  • In high-resource settings like Hindi, fine-tuning Aya-vision-8b reaches 100.12 CIDEr, compared to 91.74 for the best prompting approach.
  • Traditional captioning pipelines (BLIP + translation) fail completely, achieving an average BLEU-4 of only 0.20 across all languages.
Breakthrough Assessment
8/10
Significant contribution to multilingual multimodal resources. The dataset fills a major gap (non-English news captioning) and the evaluation rigorously demonstrates the necessity of fine-tuning over prompting for this domain.
×