Arbiter: Detecting Interference in LLM Agent System Prompts

📝 Paper Summary

Agentic AI Memory Organization Prompt Engineering

Arbiter treats system prompts as software artifacts, using formal rules and a multi-model LLM swarm to detect internal contradictions and memory failures that single models silently ignore.

Core Problem

System prompts for coding agents are complex software artifacts (up to 1,490 lines) lacking test suites; internal contradictions are resolved silently by LLMs via probabilistic heuristics rather than raising errors.

Why it matters:

Silent resolution of contradictions makes agent behavior unpredictable and dependent on model weighting rather than explicit logic
The agent resolving the conflict (the LLM) cannot reliably be the auditor of its own instructions due to its inherent 'judgment' smoothing
Monolithic prompts accumulate subsystem contradictions that are invisible to standard evaluations

Concrete Example: In Claude Code, a task management section mandates 'ALWAYS use TodoWrite', while the Commit workflow section simultaneously mandates 'NEVER use TodoWrite'. The model silently violates one instruction based on context weight, causing erratic behavior during commits without warning.

Key Novelty

Arbiter: Hybrid Directed/Undirected Prompt Testing Framework

Treats system prompts as code by parsing them into an Abstract Syntax Tree (AST) to enable formal static analysis of scope overlaps and logic conflicts
Uses 'Undirected Scouring' where diverse LLMs sequentially explore the prompt, passing their findings to the next model to ensure coverage of new vulnerability classes
Establishes a taxonomy mapping software architectures (monolithic, flat, modular) to specific prompt failure modes (growth bugs, simplicity trade-offs, composition seams)

Evaluation Highlights

Detected 152 findings across three major vendors (Claude Code, Codex CLI, Gemini CLI) and 21 hand-labeled interference patterns in Claude Code alone
Identified a critical 'structural data loss' bug in Gemini CLI's memory system where compression schemas failed to include saved user preferences (independently confirmed by Google patch)
Total analysis cost was $0.27 USD, demonstrating that comprehensive cross-vendor auditing is economically negligible compared to manual review

Breakthrough Assessment

9/10

Pioneering work treating prompts strictly as software artifacts. The discovery of a major memory data-loss bug in a production Google product validates the methodology. The taxonomy of prompt architecture failures is a significant theoretical contribution.

⚙️ Technical Details

Problem Definition

Setting: Auditing natural language system prompts P for internal interference I (contradictions, ambiguities, dependencies)

Inputs: Raw system prompt text (e.g., 245–1,490 lines of Markdown/text)

Outputs: List of classified findings F, categorized by severity (curious to alarming) and failure type

Pipeline Flow

Prompt Parser (Decomposition to AST)
Directed Evaluator (Rule-based static analysis)
Undirected Scourer (Multi-model dynamic analysis)
Convergent Termination (Stop when 3 models find nothing new)

System Modules

Prompt Parser

Decomposes raw prompt text into typed nodes (Section, Directive, Metadata) and assigns semantic roles

Model or implementation: Python-based Two-Layer Parser (Deterministic)

Directed Evaluator

Checks block pairs against formal interference rules (e.g., priority markers, verbatim duplication)

Model or implementation: Python predicates + Template-based LLM calls

Undirected Scourer

Explores the prompt for unknown vulnerability classes using a swarm of diverse LLMs

Model or implementation: Chain of: Claude Opus 4.6, Gemini 2.0 Flash, Kimi K2.5, DeepSeek V3.2, etc.

Novel Architectural Elements

Map-passing composition in scouring: Each LLM receives the 'map' of what previous models found, forcing it to explore new territory
AST-based differ for prompts: Uses structural hashes to track prompt evolution and clone detection across versions without line-diff noise

Modeling

Base Model: Multi-model swarm used for scouring (Claude Opus 4.6, Gemini 2.0 Flash, DeepSeek V3.2, etc.)

Training Method: Inference-only analysis (No training involved)

Compute: Total cost $0.27 USD via OpenRouter API for analyzing 3 vendors. Claude Code required 10 passes to converge; smaller prompts required 2-3 passes.

Comparison to Prior Work

vs. Gloague et al.: Arbiter analyzes structural conditions producing errors, while Gloague measures runtime impact.
vs. Prompt Injection: Arbiter finds internal contradictions (logic bugs), while injection research focuses on external attacks.
vs. Unit Testing [not cited in paper]: Traditional software testing checks code logic; Arbiter adapts this to natural language instructions which lack compilers.

Limitations

Directed analysis is limited to the set of pre-defined rules; it cannot find novel bug classes on its own
Scouring relies on the capability of the LLMs used; weaker models may miss subtle logical inconsistencies
Analysis is static/structural; it does not measure the runtime probability of a specific contradiction causing a failure in practice

Reproducibility

Methodology is fully described. Prompts analyzed are from public sources (npm package for Claude Code, open-source repos for Codex/Gemini). Specific Arbiter code repository is not provided in text. Scouring used specific model versions (e.g., Claude Opus 4.6, Grok 4.1) relevant to the paper's 2026 timeframe.

📊 Experiments & Results

Evaluation Setup

Cross-vendor analysis of production system prompts for coding agents

Benchmarks:

Claude Code v2.1.50 System Prompt (Monolithic Coding Agent)
Codex CLI System Prompt (Flat/Simple Coding Tool)
Gemini CLI System Prompt (Modular Coding Agent) [New]

Metrics:

Count of interference patterns
Severity distribution (curious/notable/concerning/alarming)
Passes to convergence
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results show the scale of findings across directed and undirected phases.
All Vendors	Total Findings	0	152	+152
Claude Code	Hand-labeled Patterns	0	21	+21
Cross-vendor Analysis	Total API Cost (USD)	Not reported in the paper	0.27	Not reported in the paper

Main Takeaways

Prompt architecture correlates with failure mode: Monoliths (Claude) have growth/boundary bugs; Flat prompts (Codex) have capability trade-offs; Modular prompts (Gemini) have integration bugs at seams.
Multi-model evaluation is critical: 95% of Claude Code's patterns were statically detectable, but the 'plan-mode dead zone' required semantic reasoning found only by scouring.
Scouring discovered a 'structural data loss' vulnerability in Gemini CLI where memory compression deleted user preferences—a bug class invisible to directed rules because the contract between modules was never written.
Convergence is logarithmic: Larger prompts (Claude) required 10 passes to converge; smaller ones (Codex/Gemini) required 2-3 passes.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM System Prompts (instructions governing agent behavior)
Software Engineering concepts (Linting, Static Analysis, Regression Testing)
Basic knowledge of LLM agent architectures (tools, memory, planning)

Key Terms

System Prompt: The 'constitution' or initial set of instructions that defines an AI agent's behavior, tools, and constraints

Undirected Scouring: A multi-model evaluation process where different LLMs critique a prompt sequentially, each building on the previous model's findings to discover unknown failure modes

Directed Evaluation: Analysis using formal rules (predicates) to check for specific, known failure patterns like 'mandate-prohibition conflicts' or 'scope overlap'

AST: Abstract Syntax Tree—a hierarchical representation of the prompt's structure (sections, directives, lists) used for static analysis

Composition Seams: The boundaries between modular components (e.g., different sub-prompts) where integration bugs often occur due to undefined contracts

Monolithic Prompt: A single, large document containing all instructions (e.g., Claude Code), prone to contradictions as new features are added

Modular Prompt: A prompt assembled at runtime from smaller, independent pieces (e.g., Gemini CLI), prone to bugs in the interaction between modules

Feature Flags: Switches that toggle specific sections of the prompt on or off at runtime