GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Object Hallucination

GHOST automatically generates high-quality images that induce object hallucinations in Multimodal LLMs by optimizing CLIP embeddings to mislead the target model while visually preserving object absence.

Core Problem

Existing evaluations of object hallucination in MLLMs rely on static benchmarks and fixed scenarios, failing to uncover model-specific blind spots or unanticipated vulnerabilities.

Why it matters:

Hallucinations in safety-sensitive applications (e.g., autonomous agents) pose significant reliability risks.
Static benchmarks constrain analysis to known scenarios, missing deeper structural failure modes.
Prior generative methods are either too slow/resource-intensive or lack direct feedback from the target model to find specific weaknesses.

Concrete Example: In an image of a banana on a plate, MLLMs correctly state no knife is present. GHOST modifies the banana's stem to subtly resemble a knife edge; the MLLM then hallucinates a knife, even though humans confirm no knife exists.

Key Novelty

Generating Hallucinations via Optimizing Stealth Tokens (GHOST)

Decouples optimization from generation by training a mapper between CLIP embeddings and the MLLM's vision encoder, allowing efficient feedback without full backpropagation through the diffusion model.
Optimizes a CLIP embedding to maximize the MLLM's probability of answering 'Yes' to 'Do you see [object]?' while simultaneously penalizing semantic similarity to the object to prevent actual insertion.
Uses the optimized embedding to guide a diffusion model (starting from a noisy version of the original image) to generate natural-looking adversarial examples.

Architecture

The GHOST pipeline showing the three main stages: Optimization, Mapper Training, and Guided Diffusion.

Evaluation Highlights

Achieves a 28-29% hallucination success rate on Qwen2.5-VL and LLaVA-v1.6, discovering thousands of failure cases compared to <1% for prior data-driven methods.
Demonstrates high transferability: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate.
Maintains high image quality and semantic fidelity, outperforming standard diffusion baselines in FID scores relative to the original image (e.g., 29.58 vs 36.63 for Qwen).

Breakthrough Assessment

8/10

Significantly improves the efficiency and success rate of automated red-teaming for MLLMs. The decoupling of optimization and generation is a smart architectural choice that enables scalability.

⚙️ Technical Details

Problem Definition

Setting: Adversarial generation of visual inputs to induce specific textual hallucinations in MLLMs.

Inputs: Original image X_v (without target object t), target object t.

Outputs: Modified image X~_v that causes MLLM to predict presence of t, while t remains visually absent.

Pipeline Flow

Mapper Training (Offline)
Embedding Optimization (Attack Phase)
Guided Diffusion Generation
Verification

System Modules

Mapper (Π)

Bridge the embedding spaces of the diffusion model (CLIP) and the MLLM to enable efficient gradient-based optimization.

Model or implementation: Multi-Layer Perceptron (MLP)

Optimizer

Find a perturbation in CLIP space that triggers the target MLLM response.

Model or implementation: AdamW Optimizer

Generator

Synthesize the adversarial image using the optimized embedding.

Model or implementation: Stable Diffusion unCLIP

Verifier

Ensure the target object was not accidentally generated.

Model or implementation: OWLv2 (Object Detector)

Novel Architectural Elements

Decoupled Mapper: Uses a lightweight MLP to approximate MLLM gradients w.r.t. CLIP embeddings, avoiding backpropagation through the heavy diffusion model.
Diffusion with Partial Noise: Initiates reverse diffusion from a noisy version of the original image (not pure noise) conditioned on the adversarial embedding to maintain high structural fidelity.

Modeling

Base Model: Evaluated on Qwen2.5-VL-7B-Instruct, LLaVA-v1.6-Mistral-7B, GLM-4.1V-Thinking

Comparison to Prior Work

vs. DASH: GHOST decouples optimization via a mapper (faster, ~10s vs slower DASH pipeline) and uses CLIP space embedding optimization rather than latent space optimization.
vs. AnyAttack/AttackVLM: GHOST introduces semantic-level misleading cues (e.g., modifying a banana stem) rather than imperceptible pixel noise.

Limitations

Relies on the availability of a proxy model (OWLv2) to verify object absence; detector failures could lead to false positives.
Success rate is lower on reasoning models (GLM-4.1V) compared to standard MLLMs.
Optimization requires white-box access to the target model (or a surrogate for transfer attacks).

Reproducibility

Code: https://github.com/sudoparsa/GHOST

Code and models available at https://github.com/sudoparsa/GHOST. Uses Stable Diffusion unCLIP for generation and OWLv2 for verification. Experiments use COCO dataset subsets.

📊 Experiments & Results

Evaluation Setup

Targeted object hallucination on COCO images not containing specific objects.

Benchmarks:

COCO (Object Hallucination Induction)
ObjectNet (Object Hallucination Induction)

Metrics:

Success Rate (Hallucinations / Total Attempts)
FID (Fréchet Inception Distance)
CLIP Score (Semantic Preservation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GHOST significantly outperforms the DASH baseline in generating valid hallucination-inducing images on the COCO dataset.
COCO (Qwen2.5-VL)	Success Rate	0.1	29.8	+29.7
COCO (LLaVA-v1.6)	Success Rate	0.1	28.5	+28.4
Transferability experiments showing that images optimized for one model break others.
GPT-4o	Hallucination Rate	0	66.5	+66.5
Gemini-1.5-Flash	Hallucination Rate	0	57.3	+57.3
Image quality metrics showing GHOST preserves semantic fidelity better than baselines.
COCO (Qwen2.5-VL)	FID (vs Initial Image)	36.63	29.58	-7.05

Experiment Figures

Bar charts displaying the number of hallucination-inducing images generated per object class for Qwen2.5-VL and LLaVA-v1.6.

Qualitative example of a banana/knife hallucination.

Main Takeaways

GHOST is highly effective at inducing hallucinations, achieving ~29% success rates where prior methods achieved <1%.
Vulnerabilities are transferable: Adversarial images created for open-source models (Qwen) successfully attack closed-source SOTA models (GPT-4o, Gemini).
Fine-tuning on GHOST-generated images improves model robustness (mitigation), proving the method's utility as a corrective tool.
The method generalizes to reasoning models (GLM-4.1V-Thinking), although success rates are slightly lower compared to standard MLLMs.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
CLIP (Contrastive Language-Image Pre-training)
Latent Diffusion Models
Adversarial Attacks / Red-teaming

Key Terms

MLLM: Multimodal Large Language Model—AI systems that can process and reason about both text and images.

CLIP: Contrastive Language-Image Pre-training—a model that learns joint representations for images and text, often used as the vision encoder in MLLMs.

Diffusion Model: A generative model that creates images by gradually denoising random noise, often conditioned on text or embeddings.

FID: Fréchet Inception Distance—a metric used to assess the quality of generated images by comparing their distribution to real images.

SSIM: Structural Similarity Index Measure—a metric for measuring the similarity between two images.

OWLv2: Open-Vocabulary Object Detector—a model used here to verify that the target object was not actually inserted into the generated image.

unCLIP: A variation of Stable Diffusion that conditions image generation on CLIP image embeddings rather than just text.

Mapper: A simple Multi-Layer Perceptron (MLP) trained to align CLIP embeddings with the MLLM's vision encoder space.