Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi
National University of Singapore,
Nanyang Technological University
arXiv
(2025)
MMRLReasoningBenchmark
📝 Paper Summary
Vision-Language Model SafetyAI ModerationReinforcement Learning from Human Feedback (RLHF)
GuardReasoner-VL improves VLM safety by training a guard model to explicitly reason before moderating inputs, incentivized via online reinforcement learning with safety-aware data augmentation and dynamic exploration strategies.
Core Problem
Existing VLM guard models perform simple classification without justification, lack interpretability, and struggle to generalize to complex or hidden harmful content due to limited offline training data.
Why it matters:
VLMs deployed in critical domains (education, finance) are vulnerable to jailbreaks and multimodal attacks
Standard safety alignment methods (training the victim model directly) impose an 'alignment tax' that degrades general capabilities like reasoning and creativity
Current guards are 'black boxes' that cannot explain why content is blocked, hindering trust and debugging
Concrete Example:When a user provides a seemingly harmless image with subtle hate symbols alongside text, a standard guard might classify it as 'safe' or 'unsafe' without explanation. GuardReasoner-VL outputs a reasoning trace identifying the symbol and its context, then renders a verdict, catching nuanced harm that classification-only models miss.
Key Novelty
Reason-then-moderate VLM Guard optimized via Online RL
Constructs a massive reasoning corpus (GuardReasoner-VLTrain) mixing text, image, and text-image pairs with GPT-4o generated reasoning traces
Uses 'Safety-Aware Data Concatenation' during RL to create hard samples by hiding harmful content among harmless content, forcing the model to detect subtle threats
Employs a dynamic clipping parameter in GRPO to shift from exploration to exploitation, and a length-aware reward that incentivizes deeper reasoning when the model makes errors
Architecture
The training pipeline involving Data Curation, Model Cold-Start (SFT), and Online RL optimization.
Evaluation Highlights
+19.27% average F1 score improvement over the best runner-up baseline across evaluated benchmarks
Established a new reasoning corpus with 123K samples and 631K reasoning steps covering diverse modalities
Demonstrates superior performance in both prompt harmfulness detection and response harmfulness detection tasks
Breakthrough Assessment
8/10
Significant advance in VLM safety by successfully applying reasoning-based RL (similar to generic reasoning models like o1) to the specific domain of multimodal guardrails, with strong empirical gains.
⚙️ Technical Details
Problem Definition
Setting: Multimodal Moderation: Detecting harmfulness in prompts and responses
Inputs: User prompt X (text T, image I, or pair {T, I}) and victim model response S
Outputs: Reasoning process R followed by moderation label Y_hat (harmful/unharmful)
Pipeline Flow
Input Processing (Text/Image/Pair)
GuardReasoner-VL Inference
System Modules
GuardReasoner-VL
Reason about safety risks and output a classification
Model or implementation: 3B or 7B parameter VLM (Exact architecture not specified in text, likely Qwen-VL or LLaVA based)
Novel Architectural Elements
Integration of explicit reasoning generation (<think> tags) directly into the VLM guardrail pipeline
Modeling
Base Model: 3B and 7B parameter VLMs (specific base architecture not explicitly named in snippet)
Training Method: Reasoning SFT followed by Online RL (GRPO)
Objective Functions:
Purpose: Cold-start reasoning ability via supervised learning.
Formally: Standard auto-regressive language modeling loss on reasoning corpus D.
Purpose: Optimize policy via Group Relative Policy Optimization (GRPO) without KL loss.
Formally: Maximize advantage A_i weighted by policy ratio, subject to dynamic clipping B_s.
Purpose: Encourage valid format and correct classification.
Code, data, and models (3B/7B) are publicly released at https://github.com/yueliu1999/GuardReasoner-VL/. Detailed dataset statistics and construction methods are provided.
📊 Experiments & Results
Evaluation Setup
VLM Moderation (Prompt and Response Harmfulness Detection)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
The model demonstrates significant improvements over baselines in overall moderation performance.
Average across test sets
F1 score
Not reported in the paper
Not reported in the paper
+19.27%
Experiment Figures
Comparison of GuardReasoner-VL against baselines in terms of F1 score.
Illustration of Safety-Aware Data Concatenation.
Main Takeaways
Reasoning-before-moderating significantly outperforms direct classification, improving F1 scores by over 19% on average.
Safety-aware data concatenation effectively creates hard negatives, teaching the model to identify harmful elements hidden in harmless contexts.
Online RL with dynamic exploration (clipping) allows the model to refine its reasoning capabilities beyond the initial SFT phase.
📚 Prerequisite Knowledge
Prerequisites
Vision-Language Models (VLMs)
Reinforcement Learning (RL)
Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) Reasoning
Key Terms
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy by comparing a group of outputs against each other rather than using a separate value function
VLM: Vision-Language Model—AI models capable of processing and generating both text and images
SFT: Supervised Fine-Tuning—training a model on labeled examples to establish baseline capabilities
CoT: Chain-of-Thought—a prompting or training strategy where the model generates intermediate reasoning steps before the final answer
F1 score: A metric balancing precision and recall, used here to measure moderation accuracy
Jailbreak: Adversarial prompts designed to bypass a model's safety filters