← Back to Paper List

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Yuxu Ge
University of York
arXiv (2026)
Agent RAG Benchmark

📝 Paper Summary

AI Safety & Governance Agentic Security
LGA is a four-layer governance architecture that intercepts execution-layer threats like prompt injection and malicious plugins by enforcing intent verification and OS-level isolation rather than just filtering text.
Core Problem
Existing agent defenses focus on text-generation safety (content filtering) but fail to intercept execution-layer threats where semantically benign inputs cause irreversible state changes (e.g., file deletion via tool calls).
Why it matters:
  • Autonomous agents are transitioning from conversational to executive, meaning failures now cause irreversible system state changes (file writes, API calls) rather than just bad text
  • Current guardrails (Llama Guard, NeMo) filter harmful language but lack tool-call authorization, leaving agents vulnerable to prompt injection and malicious plugins
  • Engineering focus is shifting from defect remediation to system governance, requiring architectural invariants rather than just improved model capabilities
Concrete Example: A user inputs a prompt injection that instructs an agent to 'delete all files'. A standard safety filter might see the text as harmless if phrased as a system maintenance request, but the agent executes `rm -rf /`, causing irreversible data loss because the tool call itself was not verified against the user's original intent.
Key Novelty
Layered Governance Architecture (LGA)
  • Proposes a defense-in-depth stack combining OS-level sandboxing (L1), semantic intent verification by an independent judge (L2), zero-trust inter-agent tokens (L3), and immutable audit logs (L4)
  • Shifts security from 'model alignment' to 'execution governance', requiring every high-risk tool call to be approved by a separate judge model comparing it to the original task
Evaluation Highlights
  • Local Qwen2.5-14B judge achieves 98% interception rate (IR) on malicious tool calls with ~10-20% false positive rate, outperforming lightweight NLI baselines (<10% IR)
  • A fully local cascade (Qwen3.5-9B → Qwen2.5-14B) achieves 94.7–95.6% interception rate with 6.0–9.7% false positive rate, enabling data-sovereign deployment
  • End-to-end latency overhead for the non-judge layers (sandboxing, crypto, logging) is only ~18ms, with total system P50 latency ~980ms
Breakthrough Assessment
8/10
Significant shift from content safety to execution safety. Provides a concrete, full-stack architectural solution (not just a model) and a new bilingual benchmark for tool-use threats.
×