MUNIChus: Multilingual News Image Captioning Benchmark

📝 Paper Summary

News Image Captioning Multilingual Vision-Language Benchmarks

MUNIChus introduces the first large-scale multilingual news image captioning benchmark across 9 languages, revealing that instruction fine-tuning significantly outperforms few-shot prompting, though low-resource languages like Sinhala remain challenging.

Core Problem

Existing news image captioning datasets are exclusively English, while generic captioning models fail to identify specific entities (people, events) crucial for news context.

Why it matters:

Current models trained on generic data describe visual objects (e.g., 'a crowd') but miss the journalistic essence (e.g., 'Protest against policy X'), limiting utility for visually impaired users.
The lack of multilingual datasets hinders the development of news captioning systems for non-English speakers, particularly in low-resource languages like Sinhala and Urdu.

Concrete Example: For an image of a politician at a ceremony, a generic caption generates 'A crowd of people standing around each other,' whereas the correct news caption is 'Michelle O’Neill attended the Belfast ceremony alongside Deputy First Minister Emma Little-Pengelly.'

Key Novelty

MUNIChus Benchmark

Creation of the largest news image captioning dataset covering 9 languages and over 700,000 images sourced from BBC, including headlines and articles.
Comprehensive benchmarking of state-of-the-art MLLMs using both prompting (zero-shot, few-shot) and parameter-efficient fine-tuning (QLoRA) strategies.

Architecture

The prompting setup for the Zero-shot evaluation setting.

Evaluation Highlights

Fine-tuned Aya-vision-8b achieves a CIDEr score of 56.34, more than doubling the best prompting-based performance (GPT-4o random few-shot).
In high-resource settings like Hindi, fine-tuning Aya-vision-8b reaches 100.12 CIDEr, compared to 91.74 for the best prompting approach.
Traditional captioning pipelines (BLIP + translation) fail completely, achieving an average BLEU-4 of only 0.20 across all languages.

Breakthrough Assessment

8/10

Significant contribution to multilingual multimodal resources. The dataset fills a major gap (non-English news captioning) and the evaluation rigorously demonstrates the necessity of fine-tuning over prompting for this domain.

⚙️ Technical Details

Problem Definition

Setting: Generate a journalistic caption y given an image I and a news article context C in a specific target language L.

Inputs: News image, associated news article text, target language instruction

Outputs: A text caption containing specific entities and context linking the image to the article

Pipeline Flow

Input Processing: Image + Article + Instruction
Model Processing: MLLM (Prompted or Fine-tuned)
Output Generation: Generated Caption

System Modules

Vision Encoder

Encodes the input news image into visual embeddings

Model or implementation: Model-specific (e.g., SigLIP for Aya-vision, internal for GPT-4o)

Language Model

Generates the caption based on visual embeddings and article text

Model or implementation: Llama-3.2-11B-Vision-Instruct / Aya-vision-8b / GPT-4o

Modeling

Base Model: Llama-3.2-11B-Vision-Instruct and Aya-vision-8b

Training Method: Supervised Fine-Tuning (SFT) with QLoRA

Objective Functions:

Purpose: Minimize the difference between generated tokens and reference caption tokens.

Formally: Standard language modeling loss (cross-entropy) over assistant-side target tokens, with user prompt tokens masked.

Adaptation: QLoRA (rank r=64, alpha=32, dropout=0.1)

Trainable Parameters: LoRA adapters on attention and MLP projection modules (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)

Training Data:

145,314 training images across 9 languages
Derived from 58,663 unique BBC news articles
Filtered for images >180px and captions >3 words

Key Hyperparameters:

learning_rate: 1.5e-4
batch_size: 11 (per device)
gradient_accumulation: 16
+ 4 more
epochs: 1
weight_decay: 1e-6
max_sequence_length: 4096 tokens
precision: bf16

Compute: Not reported in the paper

Comparison to Prior Work

vs. Visual News/GoodNews: MUNIChus covers 9 languages (including Sinhala, Urdu) vs. English only.
vs. Generic Captioning (BLIP): MUNIChus requires integrating article context for named entities, whereas BLIP produces generic descriptions.

Limitations

Low-resource languages (Sinhala, Urdu) still show poor performance even with fine-tuning, suggesting data underrepresentation in pre-training.
Evaluation relies on n-gram metrics (BLEU, CIDEr) because advanced metrics (BERTScore) lack support for some target languages.
Entity retrieval metrics were not used due to poor NER performance on low-resource languages.
Few-shot prompting with visual similarity retrieval proved ineffective as visual similarity does not correlate with news context.

Reproducibility

Code: https://huggingface.co/datasets/tharindu/MUNIChus

The dataset is publicly available on HuggingFace. Code for the scraper is mentioned but not explicitly linked (likely in the repo). Hyperparameters for fine-tuning are detailed. The exact test splits (8,993 images) are defined in the dataset.

📊 Experiments & Results

Evaluation Setup

Generation of captions for held-out test set images given the image and associated news article.

Benchmarks:

MUNIChus (Multilingual News Image Captioning) [New]

Metrics:

BLEU-4
CIDEr
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of fine-tuning strategies against prompting strategies shows that fine-tuning is essential for this task.
MUNIChus (All languages)	BLEU-4	3.57	8.40	+4.83
MUNIChus (All languages)	CIDEr	36.17	56.34	+20.17
MUNIChus (All languages)	BLEU-4	0.03	8.40	+8.37
MUNIChus (Hindi)	CIDEr	91.74	100.12	+8.38
MUNIChus (Japanese)	CIDEr	21.83	92.56	+70.73

Experiment Figures

Comparison between generic image captions and news image captions.

Main Takeaways

Instruction fine-tuning provides substantial gains (2x improvement) over few-shot prompting for news image captioning, demonstrating the task requires specific domain adaptation.
Traditional image captioning models (BLIP, PaliGemma) fail completely (BLEU-4 < 0.7), proving they cannot handle the contextual requirements of news (names, events) vs generic descriptions.
Performance on low-resource languages like Sinhala remains extremely low (CIDEr ~11 vs ~100 for Hindi) across all models, indicating severe pre-training data scarcity that fine-tuning alone cannot fix.
Visual similarity retrieval for few-shot prompting does not improve performance because visually similar images often lack semantically relevant news context.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with image captioning metrics (BLEU, CIDEr)
Knowledge of Parameter-Efficient Fine-Tuning (PEFT/LoRA)

Key Terms

MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and image data.

CIDEr: Consensus-based Image Description Evaluation—a metric for image captioning that measures similarity to human consensus, weighing n-grams by TF-IDF.

BLEU-4: Bilingual Evaluation Understudy—a metric measuring the overlap of 4-word sequences (n-grams) between generated text and reference text.

QLoRA: Quantized Low-Rank Adaptation—a memory-efficient fine-tuning technique that backpropagates gradients through a frozen, quantized 4-bit pre-trained model into small low-rank adapters.

Zero-shot: Asking the model to perform a task without providing any examples in the prompt.

Few-shot: Providing a small number of examples (e.g., 3 image-caption pairs) in the prompt to guide the model.

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by injecting trainable low-rank matrices into layers while freezing the main weights.

NLLB: No Language Left Behind—a state-of-the-art multilingual machine translation model.