Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

📝 Paper Summary

AI Safety & Governance Agentic Security

LGA is a four-layer governance architecture that intercepts execution-layer threats like prompt injection and malicious plugins by enforcing intent verification and OS-level isolation rather than just filtering text.

Core Problem

Existing agent defenses focus on text-generation safety (content filtering) but fail to intercept execution-layer threats where semantically benign inputs cause irreversible state changes (e.g., file deletion via tool calls).

Why it matters:

Autonomous agents are transitioning from conversational to executive, meaning failures now cause irreversible system state changes (file writes, API calls) rather than just bad text
Current guardrails (Llama Guard, NeMo) filter harmful language but lack tool-call authorization, leaving agents vulnerable to prompt injection and malicious plugins
Engineering focus is shifting from defect remediation to system governance, requiring architectural invariants rather than just improved model capabilities

Concrete Example: A user inputs a prompt injection that instructs an agent to 'delete all files'. A standard safety filter might see the text as harmless if phrased as a system maintenance request, but the agent executes `rm -rf /`, causing irreversible data loss because the tool call itself was not verified against the user's original intent.

Key Novelty

Layered Governance Architecture (LGA)

Proposes a defense-in-depth stack combining OS-level sandboxing (L1), semantic intent verification by an independent judge (L2), zero-trust inter-agent tokens (L3), and immutable audit logs (L4)
Shifts security from 'model alignment' to 'execution governance', requiring every high-risk tool call to be approved by a separate judge model comparing it to the original task

Evaluation Highlights

Local Qwen2.5-14B judge achieves 98% interception rate (IR) on malicious tool calls with ~10-20% false positive rate, outperforming lightweight NLI baselines (<10% IR)
A fully local cascade (Qwen3.5-9B → Qwen2.5-14B) achieves 94.7–95.6% interception rate with 6.0–9.7% false positive rate, enabling data-sovereign deployment
End-to-end latency overhead for the non-judge layers (sandboxing, crypto, logging) is only ~18ms, with total system P50 latency ~980ms

Breakthrough Assessment

8/10

Significant shift from content safety to execution safety. Provides a concrete, full-stack architectural solution (not just a model) and a new bilingual benchmark for tool-use threats.

⚙️ Technical Details

Problem Definition

Setting: Multi-agent system S = <A, K, T, E> where agents A map inputs X and state Sigma to tool sequences T+. The goal is to prevent adversarial inputs from triggering unauthorized tool calls.

Inputs: Agent input x containing potential prompt injection or poisoned content

Outputs: Verified tool execution sequence or blocked attempt with audit log entry

Pipeline Flow

Layer 1: Execution Sandboxing (OS isolation)
Layer 2: Intent Verification (LLM Judge)
Layer 3: Zero-Trust Protocol (Inter-agent auth)
Layer 4: Immutable Audit Log (Forensics)

System Modules

Layer 1: Sandbox

Confines agent execution to prevent unauthorized system access

Model or implementation: OS-level isolation (unshare/namespaces)

Layer 2: Judge

Verifies if tool call a is consistent with task T via prompt template rho

Model or implementation: LLM Judge (e.g., Qwen2.5-14B, GPT-4o-mini)

Layer 3: Zero-Trust

Enforces minimum-privilege access between agents using capability tokens

Model or implementation: Cryptographic Protocol (HMAC-SHA256)

Layer 4: Audit

Records all invocations for forensics

Model or implementation: Append-only storage

Novel Architectural Elements

Four-layer unified governance stack specifically for executive agents (vs. conversational guardrails)
Integration of OS-level isolation (L1) directly with semantic intent verification (L2)
Cascade judge architecture (Small Local → Large Local/Cloud) to balance latency and security

Modeling

Base Model: Evaluated Judges: Qwen3.5-4B, Llama-3.1-8B, Qwen3.5-9B, Qwen2.5-14B, GPT-4o-mini

Training Method: Inference-only evaluation (Prompting)

Compute: End-to-end non-judge latency ~18ms; Total P50 latency ~980ms (dominated by judge)

Comparison to Prior Work

vs. Llama Guard: LGA targets execution-layer tool calls, not just text generation safety
vs. NeMo Guardrails: LGA enforces OS-level isolation and tool authorization, whereas NeMo focuses on dialogue flows
vs. ToolEmu: LGA is a deployable defense architecture, whereas ToolEmu is an evaluation environment
+ 2 more
vs. InjecAgent [not cited in paper]: InjecAgent is a benchmark showing vulnerabilities; LGA provides the architectural mitigation for those specific threats
vs. AutoGen/LangChain: These are frameworks without native execution-layer isolation; LGA adds the missing governance layers

Limitations

TC3 (malicious plugins) remains hard to detect (75-94% IR) compared to other threats, requiring future work
Judge models can introduce latency (up to ~1s), which may be a bottleneck for real-time systems
PPV is low (22.7%) at low attack prevalence (1%), implying a high rate of false alarms in safe environments
Reliance on English-translated data for part of the bilingual benchmark (English via machine translation)

Reproducibility

Code: https://github.com/openclaw/openclaw

OpenClaw framework code is publicly available at https://github.com/openclaw/openclaw. The paper mentions a constructed bilingual benchmark of 1,081 tool-call samples. Specific judge prompts are described as templates. Code commit f014e25 was used for analysis.

📊 Experiments & Results

Evaluation Setup

Benchmarking intent verification judges against three threat classes using a custom dataset applied to the OpenClaw framework.

Benchmarks:

Custom Bilingual Benchmark (Tool-call authorization (Binary Classification)) [New]
InjecAgent (Indirect prompt injection)

Metrics:

Interception Rate (IR)
False Positive Rate (FPR)
Latency (ms)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Judge performance on the custom bilingual benchmark across different model sizes and threat types.
Custom Bilingual Benchmark	Interception Rate (TC1/TC2)	10	93.0	+83.0
Custom Bilingual Benchmark	Interception Rate (Overall)	91.9	98.0	+6.1
Cascade architectures offer trade-offs between local execution and cloud-based accuracy.
Custom Bilingual Benchmark	Interception Rate (IR)	91.9	95.6	+3.7
Custom Bilingual Benchmark	False Positive Rate (FPR)	20.0	9.7	-10.3
Generalization to external benchmarks confirms the efficacy of the proposed judges.
InjecAgent	Interception Rate	0	100	+100

Main Takeaways

LLM-based judges significantly outperform lightweight NLI baselines for intent verification, making them necessary despite the cost
Malicious skill plugins (TC3) are the hardest threat to detect (75-94% IR), suggesting layer 2 alone is insufficient and requires L1/L3 support
Cascade architectures (Small -> Large) provide a viable path for balancing high security (95%+ IR) with acceptable latency and cost
The full four-layer stack adds negligible overhead (~18ms) outside of the judge inference time, proving feasibility for production

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tool use (function calling)
Basic cybersecurity concepts (sandboxing, zero-trust, prompt injection)
Familiarity with agent frameworks (e.g., AutoGen, LangChain)

Key Terms

LGA: Layered Governance Architecture—the four-layer defense framework proposed in this paper

prompt injection: An attack where a user inputs malicious instructions that override the model's original system instructions

RAG poisoning: Retrieval-Augmented Generation poisoning—inserting malicious data into a knowledge base so the agent retrieves and executes it

malicious skill plugins: Third-party extensions that perform unauthorized operations (e.g., data exfiltration) while executing legitimate functions, similar to supply-chain attacks

intent verification: The process of checking if a proposed tool call semantically aligns with the user's original stated intent

seccomp: Secure Computing Mode—a Linux kernel feature that restricts the system calls a process can make

HMAC-SHA256: Hash-Based Message Authentication Code using SHA-256—a cryptographic method to verify data integrity and authenticity

NLI: Natural Language Inference—a task determining if a hypothesis logically follows from a premise, used here as a baseline for intent verification

RAG: Retrieval-Augmented Generation—enhancing model outputs by retrieving relevant documents from a knowledge base

IR: Interception Rate—the percentage of malicious attacks successfully blocked by the defense

FPR: False Positive Rate—the percentage of benign/legitimate tool calls incorrectly blocked by the defense

TTL: Time To Live—a limit on the period of time or number of hops that a packet or token is valid

PPV: Positive Predictive Value—the probability that a positive result (flagged attack) is truly a malicious attack

unshare: A Linux command used to run a program with some namespaces unshared from the parent, creating isolation (basis for containers)