Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification

📝 Paper Summary

Visual Hallucination Detection Multi-Modal Agentic Verification

Pelican detects and corrects visual hallucinations by decomposing claims into sub-questions, generating Python code to answer them via external tools, and sharing computation context between steps.

Core Problem

Large Vision Language Models (LVLMs) suffer from hallucinations due to limited training data, lack of precise grounding, and over-reliance on language priors.

Why it matters:

Hallucinations limit the trustworthiness and real-world applicability of LVLMs in visual instruction following tasks
Prior verification methods like Woodpecker lack precise grounding for specific object instances and struggle with contextual reasoning around multiple objects
Existing detectors often fail to identify inconsistencies in reasoning or adaptive corrections during the verification process

Concrete Example: If a model claims 'The disposable coffee cups are upside down on the nightstand', a standard LVLM might hallucinate the cup's orientation or location. Pelican parses this into {cups, nightstand}, verifies their existence via detection, and generates code to check the specific relation 'upside down' rather than guessing.

Key Novelty

Pelican (Program-of-Thought for Claim Verification)

Decomposes visual claims into a chain of (predicate, question) pairs that form a computational graph
Uses Program-of-Thought prompting to generate Python code that answers sub-questions by composing external tools (VQA, detectors) with native Python operators
Introduces intermediate variables to precisely reference specific object instances and shares computation results between steps to enable adaptive corrections

Architecture

The Pelican pipeline: Claim Decomposition -> Program of Thought Verification -> Reasoning & Correction.

Evaluation Highlights

Reduces hallucination rate by ~8%-32% across various baseline LVLMs on MMHal-Bench
Achieves a 27% drop in hallucinations compared to the best previous mitigation approach (Woodpecker) on MMHal-Bench
Demonstrates consistent improvements on GAVIE and MME benchmarks, improving visual understanding accuracy

Breakthrough Assessment

7/10

Strong methodological contribution by integrating Program-of-Thought with claim verification. Significant empirical gains over previous SOTA (Woodpecker). However, reliance on off-the-shelf tools limits it to the performance of those underlying detectors.

⚙️ Technical Details

Problem Definition

Setting: Visual claim verification and correction

Inputs: Image I and a text claim C (derived from initial Question q and Answer a)

Outputs: Decision d (correct/incorrect) and a rewrite r (if incorrect)

Pipeline Flow

Visual Table Construction: Detect objects -> Table T
Claim Decomposition: Parse Claim C -> Chain of (predicate, question)
PoT Verification: Generate Python code -> Execute tools -> Answer sub-questions
Reasoning & Correction: Aggregate answers -> Verify Claim -> Rewrite if needed

System Modules

Visual Table Constructor

Create a structured table of visual entities to reduce false positives

Model or implementation: YOLO + Grounding-DINO + VQA verification

Claim Decomposer

Break complex claims into atomic sub-claims based on first-order predicates

Model or implementation: LLM (Prompted)

PoT Verifier

Answer sub-questions via executable Python code

Model or implementation: LLM (Program-of-Thought Prompting)

Reasoning & Rewrite

Final decision on claim correctness and rewriting if necessary

Model or implementation: LLM (CoT Prompting)

Novel Architectural Elements

Computational graph decomposition of visual claims into predicate-based sub-questions
Shared computation context passing between sequential verification steps to enable adaptive correction
Intermediate variable binding for specific object instances within the generated Python code

Modeling

Base Model: LLM used for decomposition/PoT/Reasoning (Specific model name not explicitly detailed in text, likely GPT-4 or similar class based on capabilities described)

Comparison to Prior Work

vs. Woodpecker: Pelican uses Program-of-Thought to generate code rather than just VQA queries, allowing flexible Python logic (loops, conditionals) and better object grounding via variables.
vs. Self-Reflection methods: Pelican uses external tools (detectors) for grounding rather than relying solely on the model's internal knowledge.
vs. ViperGPT [not cited in paper]: ViperGPT generates code for VQA directly; Pelican applies code generation specifically to a decomposed verification chain for post-hoc correction.

Limitations

Performance depends heavily on the accuracy of underlying visual tools (YOLO, Grounding-DINO)
Latency may be higher due to sequential execution of the verification chain
Requires careful prompt engineering for the Code Generation step to handle API syntax correctly

Reproducibility

Code availability is not provided in the paper text. The methodology relies on standard tools (YOLO, Grounding-DINO) and prompting strategies which are described.

📊 Experiments & Results

Evaluation Setup

Post-hoc hallucination detection and correction on LVLM outputs

Benchmarks:

MMHal-Bench (Hallucination Evaluation Benchmark)
GAVIE (Hallucination Evaluation)
MME (Multimodal Evaluation Benchmark)

Metrics:

Hallucination Rate (lower is better)
Visual Understanding Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pelican significantly reduces hallucination rates compared to baselines on standard benchmarks.
MMHal-Bench	Hallucination Rate Reduction	Not reported in the paper	Not reported in the paper	8%-32%
MMHal-Bench	Hallucination Rate Reduction	Not reported in the paper	Not reported in the paper	27%

Main Takeaways

Consistent performance improvement across different baseline LVLMs on MMHal-Bench, GAVIE, and MME.
The combination of claim decomposition and Program-of-Thought verification provides more robust grounding than VQA-only approaches.
Intermediate variables for object instances allow for handling complex claims involving multiple objects.
Qualitative examples show the model successfully identifies and corrects hallucinated locations.

📚 Prerequisite Knowledge

Prerequisites

Large Vision Language Models (LVLMs)
Visual Question Answering (VQA)
Program-of-Thought (PoT) prompting
Object Detection (open/closed vocabulary)

Key Terms

Pelican: The proposed framework: correcting hallucination via claim decomposition and program of thought verification

LVLM: Large Vision Language Model—AI models that process both images and text to generate text outputs

Hallucination: When a model generates incorrect or non-existent visual details not present in the image

Program-of-Thought (PoT): A prompting strategy where the LLM generates executable code (like Python) to solve reasoning steps instead of just text

Grounding-DINO: An open-set object detector used to find objects specified in text prompts

YOLO: You Only Look Once—a fast, real-time object detection system used here for closed-vocabulary detection

First-order predicates: Logical structures used to decompose complex claims into atomic parts (e.g., Exists, Position, Count)

Visual Table: A structured representation (Pandas dataframe) of detected objects and their attributes used to ground the verification process

Woodpecker: A prior baseline method for visual claim verification that Pelican compares against