SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

📝 Paper Summary

Multimodal Safety Alignment Vision-Language Model Jailbreaking Agentic Reasoning

SaFeR-ToolKit transforms safety decision-making from an opaque final output into an explicit, auditable process by forcing models to generate structured virtual tool traces before answering.

Core Problem

Vision-language models suffer from jailbreaks and over-refusal because safety decisions are implicit and couple perception with intent, making them fragile to adversarial inputs and hard to audit.

Why it matters:

Adversarial images (e.g., prompt injections) can bypass text-only safety guards, causing models to ignore visual evidence or violate policies
Current safety tuning often leads to over-refusal, where benign requests are rejected because the model cannot explicitly separate user intent from visual context
Existing alignment methods (DPO, RLHF) optimize only the final response, leaving the reasoning process opaque and unverifiable

Concrete Example: When an image contains adversarial text (e.g., instructions to build a bomb hidden in a meme), standard models might follow the text, ignoring safety rules. Conversely, a benign query about a historical weapon might be refused. SaFeR-ToolKit forces the model to explicitly call a 'Perception' tool to describe the image and a 'Reasoning' tool to analyze intent before deciding.

Key Novelty

Protocolized Safety via Virtual Tool Traces

Formalizes safety as a checkable protocol where a 'planner' selects a persona and toolset, and a 'responder' must generate a structured trace (Perception -> Reasoning -> Decision) before answering
Uses a library of 'virtual tools' (text-based operators) that output typed records, making the intermediate reasoning steps explicit, testable, and auditable
Introduces a three-stage alignment curriculum (SFT -> DPO -> GRPO) where GRPO specifically rewards valid tool usage and reasoning depth rather than just final answer quality

Architecture

The SaFeR-ToolKit protocol showing the transformation of inputs into a structured trace (Perception -> Reasoning -> Decision) before the final answer.

Evaluation Highlights

Significant improvements in Safety/Helpfulness/Reasoning Rigor on Qwen2.5-VL-7B (53.21/52.92/19.26 -> 86.34/80.79/85.34)
Boosts Qwen2.5-VL-3B performance on safety benchmarks from 29.39 (Safety) to 84.40, while preserving general capabilities (58.67 -> 59.21)
Outperforms guard-based baselines which often increase Safety but drastically reduce Helpfulness (symptomatic of over-refusal)

Breakthrough Assessment

8/10

Strong conceptual shift from outcome-based to process-based safety alignment. The structured tool approach addresses the 'black box' safety problem effectively, with substantial empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Multimodal safety alignment where input x=(Image, query) is mapped to response y via an intermediate virtual-tool trace z

Inputs: Input pair x = (I, q)

Outputs: Structured output (z, y) where z is a trace of tool calls and y is the final response

Pipeline Flow

Planner (Determines strategy based on risk)
Responder (Generates tool trace z -> final answer y)

System Modules

Planner

Predicts risk category and selects persona, tool subset, and topology constraint

Model or implementation: Deterministic logic based on risk classifier (in data gen) / Learned behavior (in model)

Responder

Generates the virtual tool trace and final response following the planner's protocol

Model or implementation: Target VLM (e.g., Qwen2.5-VL)

Novel Architectural Elements

Virtual Tool abstraction: Safety reasoning steps are formalized as typed 'tool calls' rather than free-form CoT
Topology-constrained generation: The model's reasoning path is restricted by a directed graph (e.g., Shield, Loop) to enforce safety checks
Checkable Protocol: The separation of Perception, Reasoning, and Decision into distinct, typed steps allows automated auditing

Modeling

Base Model: Qwen2.5-VL (3B and 7B variants)

Training Method: Three-stage curriculum: SFT -> DPO -> GRPO

Objective Functions:

Purpose: SFT to learn schema.

Formally: Maximize log likelihood of target trace and response: log π(z, y | x)
Purpose: DPO to distinguish valid traces from degraded ones.

Formally: Optimize log ratio of chosen vs rejected trace likelihoods: log(π(chosen)/π(rejected))
Purpose: GRPO to encourage adaptive reasoning depth and correctness.

Formally: Group relative advantage optimization with KL penalty and compound reward R(x, z, y)

Adaptation: Full fine-tuning (implied, not explicitly restricted to LoRA)

Training Data:

SaFeR-ToolKit Dataset: 31,654 total examples
Split: 6k SFT, 18.6k DPO, 6k GRPO, 1k Eval
Sources: Synthesized from BeaverTails-V, JailBreakV-28k, and general reasoning tasks

Key Hyperparameters:

format_reward: Binary (0/1)
depth_reward_penalty: Logarithmic scaling for |z| < 3
safety_gating: Strict thresholds τ_safe, τ_task
+ 1 more
scoring_model: Qwen3-VL-32B (used as reward model for semantic scoring)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard DPO: Optimizes the *process* (tool trace) via structural perturbations, not just the final output
vs. Post-hoc Guards: Integrates safety into the generation process ('safe by design') rather than filtering afterwards, preserving helpfulness
vs. CoT Safety (e.g., Quiet-STaR): Enforces a strict *protocol* (typed tools, topology) rather than free-form thought, enabling automated verification
+ 1 more
vs. RLAIF [not cited in paper]: Uses a structured tool-based reward function (depth + format + semantic) specifically for safety, rather than generic AI feedback

Limitations

Relies on a deterministic planner for data generation, which might limit diverse strategy discovery
Inference latency increases due to the generation of intermediate tool traces
Semantic reward signal depends on a larger model (Qwen3-VL-32B), creating a dependency

Reproducibility

Code: https://github.com/Duebassx/SaFeR_ToolKit

Code available at https://github.com/Duebassx/SaFeR_ToolKit. Dataset of 31k examples provided. Training requires Qwen2.5-VL base models. Reward calculation uses Qwen3-VL-32B.

📊 Experiments & Results

Evaluation Setup

Multimodal safety and general capability assessment

Benchmarks:

Safety Benchmark (Safety/Jailbreak resistance) [New]
General Capabilities (Standard VLM benchmarks (MMBench etc. implied))

Metrics:

Safety Score
Helpfulness Score
Reasoning Rigor
General Capability Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Qwen2.5-VL-7B shows massive gains in safety and reasoning metrics compared to the base model.
Safety Benchmark (7B)	Safety Score	53.21	86.34	+33.13
Safety Benchmark (7B)	Helpfulness Score	52.92	80.79	+27.87
Safety Benchmark (7B)	Reasoning Rigor	19.26	85.34	+66.08
Performance on Qwen2.5-VL-3B mirrors the 7B results, showing the method works across model sizes.
Safety Benchmark (3B)	Safety Score	29.39	84.40	+55.01
General capability results demonstrate that the safety alignment does not degrade overall performance.
General Benchmarks (Avg)	Average Score	66.39	66.81	+0.42

Experiment Figures

The three-stage training curriculum (SFT -> DPO -> GRPO) and data synthesis pipeline.

Main Takeaways

SaFeR-ToolKit achieves a simultaneous increase in Safety, Helpfulness, and Reasoning Rigor, overcoming the safety-helpfulness trade-off (over-refusal) common in baselines.
The method scales effectively from 3B to 7B models, with the 3B model showing particularly large relative gains.
General multimodal capabilities are preserved or slightly improved, unlike guard-based methods which often degrade general performance.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Safety Alignment (Jailbreaking, Over-refusal)
Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) Reasoning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated for the same input, using the group average as a baseline

DPO: Direct Preference Optimization—an alignment method that optimizes a policy to prefer chosen responses over rejected ones without training a separate reward model

SFT: Supervised Fine-Tuning—training a model on labeled examples (demonstrations) to teach it a specific format or behavior

Virtual Tools: Text-defined operators (not external APIs) that the model invokes to produce structured intermediate reasoning artifacts (e.g., risk assessment)

Planner: A module that determines the safety strategy (persona, tool subset, topology) based on the input's estimated risk

Responder: The agent that executes the tool trace and generates the final response based on the planner's configuration

Topology: The constrained graph structure defining allowed transitions between virtual tools (e.g., linear, tree, shield)

Over-refusal: A failure mode where a safety-aligned model refuses to answer benign or helpful requests due to over-sensitivity