Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

📝 Paper Summary

Safety Evaluation Agentic AI Safety

The apparent safety degradation caused by agentic scaffolds is largely a measurement artifact caused by inadvertently converting multiple-choice benchmarks to open-ended formats during task decomposition.

Core Problem

Safety benchmarks evaluate models in isolation using multiple-choice formats, but production systems wrap models in agentic scaffolds (reasoning loops, delegation) that restructure inputs and strip answer options.

Why it matters:

Current responsible-scaling policies rely on isolated benchmark scores which may not predict safety in actual agentic deployments
Scaffolds like Map-Reduce effectively change the evaluation format (stripping options) without the evaluator's awareness, invalidating the baseline comparison
The measurement error from format shifting (5–20pp) often exceeds the actual safety impact of the scaffold architecture itself

Concrete Example: A model scoring 83% safe on a bias benchmark (multiple-choice) scores 99% when the same questions are posed without options. When a Map-Reduce scaffold decomposes a task, it strips the options, artificially triggering this score shift.

Key Novelty

Decomposition of Scaffold vs. Format Effects

Disentangles the mechanical effect of the scaffold (reasoning structure) from the representational effect (format conversion from MC to Open-Ended)
Identifies that Map-Reduce scaffolds inadvertently convert MC tasks to Open-Ended tasks by stripping options during decomposition
Demonstrates that simply propagating answer choices to worker sub-calls recovers 40–89% of the apparent safety degradation

Evaluation Highlights

Map-reduce scaffolding degrades measured safety with a Risk Difference of -7.3pp (NNH=14), while ReAct and Multi-agent scaffolds show negligible effects (<2pp).
Switching evaluation format from Multiple-Choice to Open-Ended on identical items shifts safety scores by 5–20 percentage points, an effect size larger than the scaffold architecture itself.
Generalizability across benchmarks is effectively zero (G=0.000), meaning model safety rankings reverse so completely across tasks that no composite safety index is reliable.

Breakthrough Assessment

9/10

Fundamentally challenges the validity of current safety certification for agentic AI. Proves that widely observed 'agentic misalignment' is largely a measurement artifact of format shifting.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Large Language Models wrapped in agentic scaffolding on safety benchmarks

Inputs: Safety benchmark prompts (typically Multiple-Choice)

Outputs: Scaffolded model responses (Open-Ended or derived selections)

Pipeline Flow

Input Prompt -> Scaffold Architecture -> [Decomposition/Reasoning] -> Base Model -> [Aggregation] -> Final Response

System Modules

Scaffold Controller

Intercepts user prompt and restructures execution based on architectural pattern

Model or implementation: Scripted Logic / Python Wrapper

Base Model

Generates text/reasoning based on scaffold instructions

Model or implementation: Claude Opus 4.6 (primary), Llama, GPT, Gemini variants

Novel Architectural Elements

Comparison of four distinct deployment configurations (Isolated, Map-Reduce, ReAct, Multi-Agent) on identical safety inputs
Use of 'Option-Preserving Map-Reduce' variant to isolate format effects from reasoning effects

Modeling

Base Model: Claude Opus 4.6 (primary tested model), plus 5 other frontier models

Compute: Not reported in the paper

Comparison to Prior Work

vs. Agent-SafetyBench: Controls for specific scaffold architectures (ReAct vs Map-Reduce) rather than treating 'agents' as a monolith
vs. AgentHarm: Identifies format conversion (MC -> OE) as the primary driver of score changes, rather than 'agentic misalignment'
vs. General Safety Audits: Uses equivalence testing (TOST) to statistically prove safety preservation for certain scaffolds, rather than just failing to find harm [not cited in paper]

Limitations

Study restricted to proxy safety properties (bias, sycophancy, truthfulness) rather than catastrophic risks
Relies on LLM-as-a-judge for open-ended scoring (XSTest), though validated with blinding and multiple models
Findings on map-reduce degradation are specific to implementations that strip answer options during decomposition

📊 Experiments & Results

Evaluation Setup

Comparative evaluation of 6 models across 4 deployment configurations on 4 safety benchmarks

Benchmarks:

BBQ (Bias evaluation)
TruthfulQA (Truthfulness evaluation)
XSTest (Over-refusal / Safety refusal)
Sycophancy (Agreeableness with user views)

Metrics:

Risk Difference (RD)
Number Needed to Harm (NNH)
Generalizability Coefficient (G)
Statistical methodology: Pre-registered equivalence testing (TOST) with ±2pp bounds; Specification curve analysis (384 specifications)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison of scaffold effects on safety relative to isolated models.
Benchmark Mix (Pooled)	Risk Difference (RD)	0.0	-7.3	-7.3
Benchmark Mix (Pooled)	Risk Difference (RD)	0.0	-0.7	-0.7
Benchmark Mix (Pooled)	Risk Difference (RD)	0.0	-0.6	-0.6
Format dependence and sycophancy specific results.
Safety Benchmarks (Pooled)	Safety Score Shift	0	20	20
Sycophancy	Safety Improvement	0.0	2.5	+2.5
All Benchmarks	Generalizability (G)	1.0	0.000	-1.000

Main Takeaways

Map-reduce degradation is primarily a format artifact: stripping MC options accounts for 40–89% of the observed safety drop.
ReAct and Multi-Agent scaffolds are statistically equivalent to isolated models for safety (within ±2pp), countering the narrative that agentic loops inherently degrade safety.
Sycophancy is the most volatile property: it has the lowest baseline safety (31%) and shows massive model-scaffold interaction variance (Opus -16.8pp vs Llama +18.8pp).
Composite 'safety scores' are statistically invalid (G=0.000) because model rankings reverse completely between different safety benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM safety benchmarks (bias, truthfulness, sycophancy)
Familiarity with agentic patterns (Chain-of-Thought, ReAct, Map-Reduce)
Basic statistical hypothesis testing

Key Terms

Scaffold: A software wrapper around an LLM that structures its execution (e.g., adding reasoning traces, decomposing tasks, or managing multi-agent interactions)

Map-Reduce: A scaffold pattern that decomposes a complex prompt into sub-tasks (Map), processes them in parallel, and aggregates the results (Reduce)

ReAct: Reason+Act—a scaffold where the model interleaves reasoning traces ('Thought:') with action execution

NNH: Number Needed to Harm—a statistical metric indicating how many queries must be processed to produce one additional safety failure compared to baseline (lower is worse)

TOST: Two One-Sided Tests—a statistical method used to test for equivalence (proving two conditions are effectively the same) rather than difference

Sycophancy: The tendency of a model to agree with the user's stated or implied views, regardless of truth or safety

Risk Difference (RD): The absolute difference in safety failure rates between the scaffolded system and the isolated model

Specification Curve Analysis: An analytical method that runs all defensible variations of data processing and scoring to ensure findings are robust to researcher choices