Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

📝 Paper Summary

Vision-Language Benchmarks Image Captioning Evaluation Hallucination Detection

CompreCap is a benchmark for evaluating detailed image captions using manually annotated directed scene graphs that bind attributes to objects and define directional relationships, improving upon isolated word matching.

Core Problem

Existing captioning benchmarks (like MSCOCO) use brief captions that fail to evaluate the comprehensive details generated by modern Large Vision-Language Models (LVLMs), while current detailed benchmarks treat objects, attributes, and relations as isolated words.

Why it matters:

Brief captions (avg ~10 words) cannot assess the rich visual information LVLMs are capable of generating
Current evaluation methods like DetailCaps parse isolated words, meaning a model can assign the wrong color to an object or reverse a relationship (e.g., 'A left of B' vs 'B left of A') and still get a high score
Hallucination benchmarks (POPE, FGHE) focus only on object existence, ignoring attribute descriptions and relationships

Concrete Example: If an image shows a 'red car' and a 'blue truck', a caption saying 'blue car and red truck' would score highly on bag-of-words metrics because all the correct words exist, despite the attributes being mismatched. CompreCap fixes this by evaluating the structural binding of attributes to specific objects.

Key Novelty

Directed Scene Graph Evaluation for Detailed Captions

Constructs a directed scene graph where attributes are explicitly bound to objects and relationships are directional (subject-verb-object)
Evaluates generated captions by decomposing them into sub-captions and matching them hierarchically against the ground-truth scene graph using an LLM evaluator
Includes a specialized Visual Question Answering (VQA) task for 'tiny objects' (<5% image area) to test fine-grained perception

Architecture

The evaluation pipeline showing how a generated caption is parsed and matched against the ground-truth scene graph.

Evaluation Highlights

Human performance significantly outperforms all 10 evaluated LVLMs (62.99 unified score vs 60.05 for GPT-4o), validating the benchmark's difficulty
LLaVA-Next-34B achieves the highest unified score among models (58.85), slightly outperforming GPT-4o (60.05) on object/attribute metrics but lagging in relationships
Proposed metric achieves strong consistency with human judgment compared to traditional metrics like SPICE or CLIPScore

Breakthrough Assessment

8/10

Addresses a critical gap in evaluating modern LVLMs (detailed captioning) with a rigorous, structure-aware methodology. The use of directed scene graphs to prevent attribute-swapping errors is a significant methodological improvement.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of detailed image captions generated by LVLMs against ground-truth visual content

Inputs: Image I and a generated detailed caption C

Outputs: A unified quality score S_unified derived from object coverage, attribute accuracy, and relationship accuracy

Pipeline Flow

Caption Decomposition (Split caption into sub-sentences)
Noun Extraction (Identify candidate objects)
Object Matching (Match candidates to GT objects via embedding similarity)
Attribute & Relation Scoring (LLM-based evaluation of specific sub-captions against GT annotations)

System Modules

Caption Decomposer

Split long captions into sub-captions based on sentence separators and extract nouns

Model or implementation: spaCy

Object Matcher (Evaluation)

Calculate semantic similarity between extracted nouns and ground-truth object categories

Model or implementation: Sentence BERT

Attribute/Relation Evaluator (Evaluation)

Score the accuracy of attributes and relationships for matched objects

Model or implementation: Llama-3-8B-Instruct (as Judge)

Novel Architectural Elements

Hierarchical decomposition of captions mapped to a directed scene graph (Objects -> Attributes/Relations) rather than bag-of-words matching

Modeling

Base Model: Llama-3-8B-Instruct (used as the evaluator)

Comparison to Prior Work

vs. MSCOCO: CompreCap provides dense annotations for detailed captions (~172 words vs 10 words)
vs. DetailCaps: CompreCap binds attributes to specific objects and enforces directional relationships, preventing 'bag-of-words' scoring errors where attributes are swapped
vs. POPE: CompreCap evaluates attributes and relations, not just object presence

Limitations

Dependency on Llama-3 as a judge; biases in the evaluator model could affect scores
Manual annotation process is expensive, limiting dataset size (560 images) compared to automated benchmarks
Focuses on 'common objects' defined in a specific vocabulary, potentially missing open-vocabulary concepts

Reproducibility

Code: https://github.com/wangxiao5791509/CompreCap

publicly available (https://github.com/wangxiao5791509/CompreCap). Dataset annotations (CompreCap) and evaluation scripts are released. Evaluation uses open-source Llama-3-8B-Instruct.

📊 Experiments & Results

Evaluation Setup

Evaluated 10 LVLMs on detailed caption generation and fine-grained VQA using the CompreCap dataset (560 images derived from COCO panoptic segmentation).

Benchmarks:

CompreCap Captioning (Detailed Image Captioning) [New]
CompreQA-P / CompreQA-Cap (Fine-grained Visual Question Answering (Tiny Objects)) [New]

Metrics:

S_object (Object Coverage %)
S_attribute (Attribute Score 0-5)
S_relation (Relation Score 0-5)
S_unified (Weighted Average 0-100)
S-Cov (Pixel Coverage %)
Statistical methodology: Reported mean and standard deviation across 3 evaluation runs.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of LVLMs on detailed captioning (Unified Score). LLaVA-Next-34B and GPT-4o lead, but humans still outperform all models.
CompreCap Captioning	S_unified	62.99	60.05	-2.94
CompreCap Captioning	S_unified	60.05	58.85	-1.20
CompreCap Captioning	S_unified	50.32	58.48	+8.16
Evaluation of fine-grained perception (tiny objects < 5% pixels). InternVL excels here.
CompreQA-P (Presence)	Accuracy (%)	35.28	91.67	+56.39
CompreQA-Cap (Caption Selection)	Accuracy (%)	96.83	94.33	-2.50

Experiment Figures

Comparison of different evaluation methods: Direct scoring by Llama-3 vs. CompreCap's structured evaluation vs. Human evaluation.

Distribution of missed objects by size (pixel coverage percentage).

Main Takeaways

Caption length does not equal quality; MiniGPT4-v2 generates very long captions (350 words) but scores poorly (42.28) due to hallucinations and inaccuracy.
Most LVLMs struggle with tiny objects (<5% of image), often ignoring them completely in captions or failing presence tests.
The unified metric (S_unified) aligns better with human judgment than traditional n-gram metrics or CLIPScore, which fail to capture structural details in long texts.

📚 Prerequisite Knowledge

Prerequisites

Image Captioning
Scene Graphs
Large Vision-Language Models (LVLMs)
Visual Question Answering (VQA)

Key Terms

Scene Graph: A structured representation of an image where nodes are objects/attributes and edges are relationships

LVLM: Large Vision-Language Model—a model capable of processing both images and text to generate textual outputs

VQA: Visual Question Answering—a task where a model answers natural language questions about an image

Panoptic Segmentation: A computer vision task that unifies semantic segmentation (classifying pixels) and instance segmentation (detecting distinct objects)

IoU: Intersection over Union—a metric to measure the overlap between a predicted segmentation mask and a ground truth mask

Hallucination: When a model generates text descriptions of objects or relationships that do not exist in the source image

Sentence BERT: A modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings

spaCy: An open-source software library for advanced Natural Language Processing, used here for noun extraction