← Back to Paper List

Arbiter: Detecting Interference in LLM Agent System Prompts

Tony Mason
the University of British Columbia, the Georgia Institute of Technology
arXiv (2026)
Memory Agent Benchmark

📝 Paper Summary

Agentic AI Memory Organization Prompt Engineering
Arbiter treats system prompts as software artifacts, using formal rules and a multi-model LLM swarm to detect internal contradictions and memory failures that single models silently ignore.
Core Problem
System prompts for coding agents are complex software artifacts (up to 1,490 lines) lacking test suites; internal contradictions are resolved silently by LLMs via probabilistic heuristics rather than raising errors.
Why it matters:
  • Silent resolution of contradictions makes agent behavior unpredictable and dependent on model weighting rather than explicit logic
  • The agent resolving the conflict (the LLM) cannot reliably be the auditor of its own instructions due to its inherent 'judgment' smoothing
  • Monolithic prompts accumulate subsystem contradictions that are invisible to standard evaluations
Concrete Example: In Claude Code, a task management section mandates 'ALWAYS use TodoWrite', while the Commit workflow section simultaneously mandates 'NEVER use TodoWrite'. The model silently violates one instruction based on context weight, causing erratic behavior during commits without warning.
Key Novelty
Arbiter: Hybrid Directed/Undirected Prompt Testing Framework
  • Treats system prompts as code by parsing them into an Abstract Syntax Tree (AST) to enable formal static analysis of scope overlaps and logic conflicts
  • Uses 'Undirected Scouring' where diverse LLMs sequentially explore the prompt, passing their findings to the next model to ensure coverage of new vulnerability classes
  • Establishes a taxonomy mapping software architectures (monolithic, flat, modular) to specific prompt failure modes (growth bugs, simplicity trade-offs, composition seams)
Evaluation Highlights
  • Detected 152 findings across three major vendors (Claude Code, Codex CLI, Gemini CLI) and 21 hand-labeled interference patterns in Claude Code alone
  • Identified a critical 'structural data loss' bug in Gemini CLI's memory system where compression schemas failed to include saved user preferences (independently confirmed by Google patch)
  • Total analysis cost was $0.27 USD, demonstrating that comprehensive cross-vendor auditing is economically negligible compared to manual review
Breakthrough Assessment
9/10
Pioneering work treating prompts strictly as software artifacts. The discovery of a major memory data-loss bug in a production Google product validates the methodology. The taxonomy of prompt architecture failures is a significant theoretical contribution.
×