Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

📝 Paper Summary

Multimodal Safety Alignment Reinforcement Learning from Human Feedback (RLHF)

Safe RLHF-V decouples helpfulness and safety preferences in multimodal models, using a constrained optimization framework and a new dataset to improve safety without sacrificing capability.

Core Problem

Multimodal Large Language Models (MLLMs) face unique safety risks where images induce harmful content, and existing datasets lack strong visual-text correlations, making it hard to balance helpfulness and safety.

Why it matters:

Images can implicitly induce MLLMs to generate harmful content (jailbreaks) that purely text-based safety alignment misses.
Naive refusal strategies (e.g., refusing everything) ensure safety but destroy helpfulness; balancing both requires resolving their inherent conflict.
Existing multimodal safety datasets often have weak image-text correlation, where the image doesn't actually contribute to the harmfulness, limiting training effectiveness.

Concrete Example: In current datasets like SPA-VL, an adversarial query's harmfulness is often independent of the image (ASR increases only 6.8% when adding the image vs. text-only). A model trained on this fails to learn grounded safety, whereas Safe RLHF-V targets scenarios where the visual context specifically triggers the harm.

Key Novelty

Decoupled Dual-Preference Optimization for Multimodal Safety

Separates human feedback into two distinct streams: one for helpfulness and one for safety, rather than a single 'better/worse' signal that conflates them.
Introduces a granular 7-point safety scale (from severe harm to proactive warning) to train a multi-level guardrail system rather than a simple binary safe/unsafe classifier.
Uses a Lagrangian-based constrained optimization approach (Safe RLHF) adapted for vision-language models to maximize helpfulness while strictly enforcing a safety budget.

Architecture

The Safe RLHF-V framework flow: Data Construction -> Guardrail Training -> Alignment Algorithm.

Evaluation Highlights

+34.2% improvement in safety and +34.3% improvement in helpfulness compared to the base model using Safe RLHF-V.
Beaver-Guard-V achieves 85% accuracy in detecting harmful content with multi-level meta labels, outperforming baselines.
Applying the guardrail over 5 rounds of filtering reduces the precursor model's Attack Success Rate (ASR) effectively, enhancing overall safety by an average of 40.9%.

Breakthrough Assessment

8/10

First comprehensive framework (dataset + guardrail + algorithm) for multimodal safety that explicitly decouples safety/helpfulness preferences. Addresses the critical 'forgetting' issue in MLLM fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Constrained Markov Decision Process (CMDP) for multimodal generation

Inputs: Multimodal prompt x (image + text)

Outputs: Response y that maximizes reward while satisfying safety cost constraints

Pipeline Flow

Input Processing (Image + Text)
Guardrail Filtering (Beaver-Guard-V)
Response Generation (MLLM Policy)
Dual Reward Evaluation (Helpfulness RM + Safety Cost Model)

System Modules

Beaver-Guard-V

Proactively filter unsafe queries and adversarial attacks before generation

Model or implementation: Fine-tuned MLLM (based on LLaVA/Qwen architectures)

MLLM Policy

Generate helpful and safe responses

Model or implementation: LLaVA-1.5-7B or similar base MLLM

Novel Architectural Elements

Integration of multi-level safety meta-labels (minor, moderate, severe) into the guardrail training pipeline
Dual-model preference optimization within the MLLM architecture: separate heads/models for reward (helpfulness) and cost (safety) estimation driving the policy update

Modeling

Base Model: LLaVA-1.5-7B, Qwen2-VL-7B (for various experiments)

Training Method: Safe RLHF (Lagrangian-based PPO)

Objective Functions:

Purpose: Maximize expected reward while keeping expected cost below a limit.

Formally: max_π E[r(s,a)] s.t. E[c(s,a)] <= b
Purpose: Solve the constrained problem using the Lagrangian primal-dual method.

Formally: min_λ max_θ L(θ, λ) = E[r(s,a)] - λ(E[c(s,a)] - b)

Training Data:

BeaverTails-V dataset: 32k prompts, dual preference annotations, 9 primary / 20 secondary harm categories
Source images from Yandex using taxonomy-driven keywords
Responses generated by multiple VLLMs (Ovis, Phi-3.5, etc.) and filtered for diversity

Key Hyperparameters:

budget_bound: Used to stabilize optimization (specific value not explicitly in text, likely tuned per experiment)

Comparison to Prior Work

vs. SPA-VL: BeaverTails-V has stronger visual-text grounding and multi-level safety labels
vs. RLHF-V: Safe RLHF-V explicitly models safety as a cost constraint separate from the helpfulness reward, rather than mixing them
vs. Standard RLHF: Uses constrained optimization (Safe RLHF) adapted for MLLMs to solve the helpfulness-safety trade-off

Limitations

Training stability in multimodal constrained optimization remains a significant engineering challenge.
Reliance on powerful proprietary models (GPT-4o) for annotation and evaluation guidance.
Guardrail effectiveness depends on the granularity and quality of the safety taxonomy.

Reproducibility

Code: https://github.com/SafeRLHF-V

All datasets (BeaverTails-V), models (Beaver-Guard-V), and code are open-sourced at https://github.com/SafeRLHF-V. The paper provides detailed taxonomy and annotation guidelines in appendices.

📊 Experiments & Results

Evaluation Setup

Safety and Helpfulness evaluation using benchmarks and red-teaming

Benchmarks:

BeaverTails-V (Test Set) (Safety & Helpfulness Preference) [New]
MM-SafetyBench (Multimodal Safety Evaluation)
SPA-VL (Safety Preference Adaptation)

Metrics:

Attack Success Rate (ASR)
Accuracy (for Guardrail)
False Positive Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BeaverTails-V (Binary Setting)	Accuracy	Not reported in the paper	78	Not reported in the paper
BeaverTails-V (Multi-level Setting)	Accuracy	Not reported in the paper	85	Not reported in the paper
Safe RLHF-V Evaluation	Safety Improvement	0	34.2	+34.2
Safe RLHF-V Evaluation	Helpfulness Improvement	0	34.3	+34.3

Experiment Figures

Performance of Beaver-Guard-V under different 'Filter of N' (FoN) rounds.

Main Takeaways

Image content in existing datasets (SPA-VL) has minimal influence on query harmfulness (ASR rises only ~6.8% with images), validating the need for BeaverTails-V.
Multi-round moderation (Filter of 5) with Beaver-Guard-V consistently lowers Attack Success Rate (ASR) across different MLLMs, achieving the lowest ASR compared to fewer rounds.
Safe RLHF-V successfully improves both helpfulness and safety simultaneously, overcoming the 'safety tax' often observed where safer models become less helpful.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Constrained Optimization (Lagrangian methods)
Multimodal Large Language Models (MLLMs)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences.

MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and images.

CMDP: Constrained Markov Decision Process—an RL framework where the agent maximizes reward subject to cost constraints (e.g., safety limits).

Lagrangian method: A mathematical technique used to solve constrained optimization problems by incorporating constraints into the objective function via multipliers.

ASR: Attack Success Rate—the percentage of adversarial prompts that successfully trigger a model to generate harmful or unsafe content.

BeaverTails-V: The dataset introduced in this paper, featuring dual preference annotations (helpfulness/safety) and graded safety labels.

Guardrail: A safety mechanism (often a separate model) that filters inputs or outputs to prevent the main model from processing or generating harmful content.