← Back to Paper List

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

David Gringras
Harvard University
arXiv (2026)
Agent Benchmark Factuality

📝 Paper Summary

Safety Evaluation Agentic AI Safety
The apparent safety degradation caused by agentic scaffolds is largely a measurement artifact caused by inadvertently converting multiple-choice benchmarks to open-ended formats during task decomposition.
Core Problem
Safety benchmarks evaluate models in isolation using multiple-choice formats, but production systems wrap models in agentic scaffolds (reasoning loops, delegation) that restructure inputs and strip answer options.
Why it matters:
  • Current responsible-scaling policies rely on isolated benchmark scores which may not predict safety in actual agentic deployments
  • Scaffolds like Map-Reduce effectively change the evaluation format (stripping options) without the evaluator's awareness, invalidating the baseline comparison
  • The measurement error from format shifting (5–20pp) often exceeds the actual safety impact of the scaffold architecture itself
Concrete Example: A model scoring 83% safe on a bias benchmark (multiple-choice) scores 99% when the same questions are posed without options. When a Map-Reduce scaffold decomposes a task, it strips the options, artificially triggering this score shift.
Key Novelty
Decomposition of Scaffold vs. Format Effects
  • Disentangles the mechanical effect of the scaffold (reasoning structure) from the representational effect (format conversion from MC to Open-Ended)
  • Identifies that Map-Reduce scaffolds inadvertently convert MC tasks to Open-Ended tasks by stripping options during decomposition
  • Demonstrates that simply propagating answer choices to worker sub-calls recovers 40–89% of the apparent safety degradation
Evaluation Highlights
  • Map-reduce scaffolding degrades measured safety with a Risk Difference of -7.3pp (NNH=14), while ReAct and Multi-agent scaffolds show negligible effects (<2pp).
  • Switching evaluation format from Multiple-Choice to Open-Ended on identical items shifts safety scores by 5–20 percentage points, an effect size larger than the scaffold architecture itself.
  • Generalizability across benchmarks is effectively zero (G=0.000), meaning model safety rankings reverse so completely across tasks that no composite safety index is reliable.
Breakthrough Assessment
9/10
Fundamentally challenges the validity of current safety certification for agentic AI. Proves that widely observed 'agentic misalignment' is largely a measurement artifact of format shifting.
×