← Back to Paper List

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang
Nanjing University of Aeronautics and Astronautics, Institute for AI, Tsinghua University, Huzhou University, Institute of Dataspace, Hefei Comprehensive National Science Center, University of Science and Technology of China
arXiv (2026)
MM Factuality

📝 Paper Summary

Machine Unlearning in MLLMs Hallucination Mitigation Adversarial Robustness
SARE frames unlearning as a min-max optimization problem using Targeted-SAM to flatten the loss landscape, ensuring hallucinations are robustly erased and do not resurface under weight perturbations.
Core Problem
Standard unlearning methods for MLLMs achieve only superficial suppression, trapping models in sharp minima where hallucinations catastrophically resurge after lightweight relearning or parameter perturbations.
Why it matters:
  • Current unlearning is structurally fragile; 'forgotten' knowledge is merely suppressed at a sharp point rather than truly erased
  • Models quickly revert to hallucination-prone behavior after exposure to just tens of relearning samples, undermining safety in real-world deployment
  • Existing solutions like EFUF focus on data curation but neglect the geometric stability of the optimization process
Concrete Example: After unlearning, a model might correctly caption an image without hallucinations. However, if exposed to just 140 relearning samples, the hallucination rate (Human_S) of a baseline unlearned model spikes from ~20 to 29.0, effectively undoing the safety alignment.
Key Novelty
Sharpness-Aware Robust Erasure (SARE)
  • Reformulates unlearning as a min-max game: an inner loop finds the weight perturbation that maximally revives hallucinations, and an outer loop minimizes loss under this worst-case scenario
  • Uses a Targeted-SAM mechanism to explicitly flatten the loss geometry around the unlearned state, making the erasure invariant to small weight shifts or fine-tuning
  • Integrates automated data curation (negative targets for erasure, positive anchors for grounding) with this robust optimization to balance erasure with capability preservation
Architecture
Architecture Figure Figure 2
The SARE framework pipeline, illustrating data curation and the Targeted-SAM optimization process.
Evaluation Highlights
  • Reduces object hallucination (Chair_S) on mPLUG-Owl from 69.6 (Vanilla) to 37.3, significantly outperforming the EFUF baseline (43.6)
  • Maintains robustness against relearning attacks: under 140 relearning samples, SARE limits Human_S rebound to 21.0 on LLaVA, while EFUF degrades to 29.0
  • Preserves generation quality: achieves 18.9 Bleu-4 on LLaVA (vs. EFUF's 18.2) and improves perplexity to 0.101 (vs. EFUF's 0.113)
Breakthrough Assessment
8/10
Identifies a critical robustness failure in existing MLLM unlearning (sharp minima) and successfully applies SAM principles to fix it. Strong empirical gains against relearning attacks.
×