Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

📝 Paper Summary

Machine Unlearning in MLLMs Hallucination Mitigation Adversarial Robustness

SARE frames unlearning as a min-max optimization problem using Targeted-SAM to flatten the loss landscape, ensuring hallucinations are robustly erased and do not resurface under weight perturbations.

Core Problem

Standard unlearning methods for MLLMs achieve only superficial suppression, trapping models in sharp minima where hallucinations catastrophically resurge after lightweight relearning or parameter perturbations.

Why it matters:

Current unlearning is structurally fragile; 'forgotten' knowledge is merely suppressed at a sharp point rather than truly erased
Models quickly revert to hallucination-prone behavior after exposure to just tens of relearning samples, undermining safety in real-world deployment
Existing solutions like EFUF focus on data curation but neglect the geometric stability of the optimization process

Concrete Example: After unlearning, a model might correctly caption an image without hallucinations. However, if exposed to just 140 relearning samples, the hallucination rate (Human_S) of a baseline unlearned model spikes from ~20 to 29.0, effectively undoing the safety alignment.

Key Novelty

Sharpness-Aware Robust Erasure (SARE)

Reformulates unlearning as a min-max game: an inner loop finds the weight perturbation that maximally revives hallucinations, and an outer loop minimizes loss under this worst-case scenario
Uses a Targeted-SAM mechanism to explicitly flatten the loss geometry around the unlearned state, making the erasure invariant to small weight shifts or fine-tuning
Integrates automated data curation (negative targets for erasure, positive anchors for grounding) with this robust optimization to balance erasure with capability preservation

Architecture

The SARE framework pipeline, illustrating data curation and the Targeted-SAM optimization process.

Evaluation Highlights

Reduces object hallucination (Chair_S) on mPLUG-Owl from 69.6 (Vanilla) to 37.3, significantly outperforming the EFUF baseline (43.6)
Maintains robustness against relearning attacks: under 140 relearning samples, SARE limits Human_S rebound to 21.0 on LLaVA, while EFUF degrades to 29.0
Preserves generation quality: achieves 18.9 Bleu-4 on LLaVA (vs. EFUF's 18.2) and improves perplexity to 0.101 (vs. EFUF's 0.113)

Breakthrough Assessment

8/10

Identifies a critical robustness failure in existing MLLM unlearning (sharp minima) and successfully applies SAM principles to fix it. Strong empirical gains against relearning attacks.

⚙️ Technical Details

Problem Definition

Setting: Multimodal unlearning to erase object hallucinations while preserving general capabilities

Inputs: Image v, text prompt x

Outputs: Generated caption y (free of specific hallucinated objects)

Pipeline Flow

Data Curation (Automated Pipeline)
Targeted Sharpness Tuning (Optimization)

System Modules

Data Curation

Identify hallucinated vs. grounded objects using CLIP scores

Model or implementation: CLIP-based alignment

Targeted-SAM Optimizer

Update model weights to minimize loss in a flat region

Model or implementation: Standard MLLM (e.g., LLaVA, mPLUG-Owl)

Novel Architectural Elements

Integration of Targeted-SAM into the unlearning objective function (min-max formulation specifically for hallucination suppression)

Modeling

Base Model: Evaluated on mPLUG-Owl-7B and LLaVA-v1.5-7B

Training Method: Targeted Sharpness-Aware Minimization (Targeted-SAM) on curated unlearning data

Objective Functions:

Purpose: Simulate worst-case attack (Inner Maximization).

Formally: max_epsilon L_neg(theta + epsilon) s.t. ||epsilon|| <= rho
Purpose: Minimize unlearning loss under worst-case perturbation (Outer Minimization).

Formally: min_theta [ L_neg(theta + epsilon*) + L_pos(theta) + L_sent(theta) ]

Adaptation: Full fine-tuning (implied by context of unlearning standard weights)

Trainable Parameters: Model parameters theta

Training Data:

Derived from MSCOCO dataset
~30,000 triplets for unlearning (negative caption, positive caption, sentence preservation sample)

Key Hyperparameters:

rho: Neighborhood radius for perturbation (value not explicitly listed in text, but defined as hyperparameter)
T0: High-confidence grounding threshold (from EFUF)
T1: Hallucination threshold (from EFUF)
+ 1 more
T2: Sentence-level reliability threshold

Compute: Not reported in the paper

Comparison to Prior Work

vs. EFUF: SARE adds geometric regularization (SAM) to flatten the loss landscape, preventing relearning attacks where EFUF fails
vs. Gradient Ascent: SARE balances forgetting with preservation and robustness, whereas GA often destroys general capabilities
vs. SCRUB/SS-SE [not cited in paper]: SARE specifically targets multimodal object hallucinations via automated granularity, unlike general LLM unlearning methods

Limitations

Relies on automated CLIP-based metrics for data curation, which may have its own noise
Computationally more expensive than standard unlearning due to the two-step SAM update (requires double gradient computation per step)
Experiments limited to MSCOCO dataset and two specific MLLM architectures (mPLUG-Owl, LLaVA)

Reproducibility

Code availability is not provided in the paper text. The method relies on the EFUF data curation pipeline which is cited. MSCOCO dataset is public.

📊 Experiments & Results

Evaluation Setup

Image Captioning and Hallucination probing on MSCOCO dataset

Benchmarks:

MSCOCO (Image Captioning / Object Hallucination)

Metrics:

CHAIR (Chair_S, Chair_I) - Automated Hallucination Metric
POPE - Polling-based Object Probing
MHumanEval (Human_S) - Human-verified hallucination rate
BLEU (Bleu-4) - Text similarity/quality
Perplexity (PPL) - Fluency
Informativeness - Semantic coverage
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main hallucination mitigation performance showing SARE outperforms baselines on both architectures.
MSCOCO (mPLUG-Owl)	Chair_S (lower is better)	43.6	37.3	-6.3
MSCOCO (mPLUG-Owl)	Chair_S (lower is better)	69.6	37.3	-32.3
Robustness against Relearning Attacks: SARE resists memory recovery better than EFUF.
MSCOCO (LLaVA)	Human_S (lower is better)	29.0	21.0	-8.0
Robustness against LoRA Fine-tuning perturbations.
MSCOCO (LLaVA)	Chair_I (lower is better)	20.8	17.4	-3.4
Generation Quality preservation.
MSCOCO (LLaVA)	Bleu-4	18.2	18.9	+0.7
MSCOCO (LLaVA)	Perplexity (PPL)	0.113	0.101	-0.012

Experiment Figures

Comparison of hallucination rates between EFUF and SARE under relearning attacks, plus a conceptual loss landscape visualization.

Main Takeaways

SARE effectively erases hallucinations while maintaining or improving general generation quality (BLEU, Perplexity) compared to baselines.
Geometric stability is crucial for unlearning: flattening the loss landscape prevents the rapid resurgence of hallucinations observed in standard methods like EFUF.
The method is robust across different attack vectors: Relearning (data exposure), LoRA (parameter updates), and Adversarial Prompting (input perturbation).
Consistent improvements across different MLLM architectures (mPLUG-Owl and LLaVA) suggest the approach is architecture-agnostic.

📚 Prerequisite Knowledge

Prerequisites

Machine Unlearning concepts (forget set vs. retain set)
Sharpness-Aware Minimization (SAM) optimization
Multimodal LLM architectures (CLIP + LLM)
Adversarial Training principles (min-max optimization)

Key Terms

SAM: Sharpness-Aware Minimization—an optimization technique that seeks parameters in a flat region of the loss landscape to improve generalization and robustness

Relearning Attack: An adversarial evaluation method where an unlearned model is fine-tuned on a small amount of the original 'forgotten' data to see if the erased behavior resurfaces

Chair: Captioning Hallucination Assessment with Image Relevance—a metric for quantifying object hallucinations in image captioning (Chair_S = sentence level, Chair_I = image level)

POPE: Polling-based Object Probing Evaluation—a method to evaluate object hallucination by asking yes/no questions about the existence of objects in an image

EFUF: Efficient Fine-grained Unlearning Framework—a baseline unlearning method for MLLMs that uses negative and positive subsentences but lacks sharpness-aware optimization

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

MHumanEval: A human-verified evaluation benchmark for assessing hallucinations in multimodal outputs

Targeted-SAM: The paper's specific adaptation of SAM where the inner maximization targets the hallucination loss specifically to find the worst-case relapse direction

CLIP: Contrastive Language-Image Pre-training—a model used here to score the alignment between image regions and text segments to detect hallucinations automatically