TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents

📝 Paper Summary

Agentic AI Safety & Robustness

TrajAD is a generative verifier that detects and precisely localizes anomalies in agent execution trajectories, enabling efficient rollback-and-retry recovery instead of full restarts.

Core Problem

Current agent safety measures focus on static input/output filtering or capability enhancement, failing to detect runtime process anomalies like infinite loops, redundant actions, or intermediate reasoning errors.

Why it matters:

Unverified intermediate steps can trigger irreversible state changes (e.g., database corruption) even if the final output seems plausible
Blindly restarting failed tasks wastes computational resources; agents need to know exactly *where* they failed to rollback efficiently
General-purpose LLMs struggle to distinguish between complex reasoning and redundant loops without specific process supervision

Concrete Example: An agent might enter an infinite loop or execute redundant actions that are locally plausible but globally inefficient. Standard outcome-based evaluations miss this if the task eventually completes, while static guardrails fail to catch the temporal inefficiency.

Key Novelty

Generative Trajectory Verifier with Step-Level Localization

Formulates anomaly detection as a conditional generation task where the model outputs both a verdict (Normal/Anomaly) and the exact index of the first error step
Synthesizes a large-scale dataset (TrajBench) using a 'Perturb-and-Complete' strategy to create paired normal/anomalous trajectories with precise ground-truth error labels
Enables a 'Check-and-Act' runtime monitor that interrupts faulty execution and triggers targeted rollbacks

Architecture

The TrajAD framework workflow, illustrating the runtime monitoring process.

Evaluation Highlights

Outperforms strong zero-shot baselines (Qwen3-8B) by +11.38% in Macro-F1 for anomaly detection
Achieves a massive +48.21% improvement in Joint Exact Match (JEM) for error localization compared to baselines, which often fail to pinpoint the error step
Demonstrates strong transferability to unseen domains (e.g., Embodied AI), improving Macro-F1 from 70.89% (zero-shot) to 83.09%

Breakthrough Assessment

7/10

Significant step forward in process supervision for agents. The shift from outcome-based to process-based verification with precise localization is critical for practical deployment.

⚙️ Technical Details

Problem Definition

Setting: Supervised auditing task mapping an execution trajectory T to a tuple (c, l)

Inputs: Trajectory T consisting of sequence of (thought, action, observation) triplets

Outputs: Verdict c ∈ {Normal, Anomaly} and First Error Step index l ∈ {1, ..., n}

Pipeline Flow

Agent Execution Step
Trajectory History Accumulation
TrajAD Verification (Generative)
Decision (Continue or Rollback)

System Modules

Agent

Executes task steps (Thought → Action → Observation)

Model or implementation: Standard LLM Agent (e.g., Qwen/GPT)

TrajAD Verifier

Audits the current trajectory to detect anomalies and localize errors

Model or implementation: Qwen3-4B-Instruct fine-tuned with LoRA

Controller

Interrupts execution if anomaly detected and triggers rollback

Model or implementation: Rule-based logic

Novel Architectural Elements

Integration of a generative verifier that outputs structured diagnostic reports (verdict + index) directly into the agent's execution loop for runtime monitoring

Modeling

Base Model: Qwen3-4B

Training Method: Supervised Fine-Tuning (SFT) on TrajBench

Objective Functions:

Purpose: Minimize the negative log-likelihood of the target diagnostic report (verdict and location).

Formally: Standard autoregressive language modeling loss.

Adaptation: LoRA (rank=8, alpha=16) on all linear layers

Trainable Parameters: 1.8% of total parameters

Training Data:

TrajBench dataset (63,484 samples)
Balanced 1:1 ratio of Normal to Anomalous trajectories
13 tasks across 5 domains (Reasoning, Math, Coding, Web, Embodied AI)

Key Hyperparameters:

learning_rate: 2e-5
optimizer: Paged AdamW 8-bit
warmup: 10%
+ 2 more
lora_rank: 8
lora_alpha: 16

Compute: Single NVIDIA A100 (80GB) GPU

Comparison to Prior Work

vs. PRM: TrajAD is a runtime monitor for specific instances, not just for optimizing general policy parameters; it provides explicit error localization.
vs. Safety Guardrails: TrajAD has temporal awareness to detect global anomalies like loops or inefficiency, whereas guardrails typically check atomic actions in isolation.
vs. LLM-as-a-Judge: TrajAD is fine-tuned on specific anomaly data, enabling it to detect subtle procedural errors that zero-shot judges miss.

Limitations

Localization performance in cross-domain transfer (OOD) lags behind fully supervised settings.
Requires a high-quality seed dataset (golden trajectories) to synthesize negative samples.
Inference cost increases as the trajectory length grows due to full context processing.

Reproducibility

Data synthesis pipeline and evaluation metrics are described in detail. Code availability is not explicitly provided in the paper text. TrajBench construction uses AgentBank as a seed.

📊 Experiments & Results

Evaluation Setup

Trajectory auditing on TrajBench. Models must predict anomaly status and precise error step.

Benchmarks:

TrajBench (Trajectory Anomaly Detection & Localization) [New]

Metrics:

Precision
Recall
Macro-F1
Joint Exact Match (JEM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
In-Distribution (ID) performance comparison shows TrajAD significantly outperforming general-purpose zero-shot baselines, particularly in localization.
TrajBench	Macro-F1	70.43	81.81	+11.38
TrajBench	Joint Exact Match (JEM)	5.54	53.75	+48.21
TrajBench	Recall	28.46	88.16	+59.70
Out-of-Distribution (OOD) transfer experiments where Embodied AI is held out during training.
TrajBench (Target: Embodied AI)	Macro-F1	70.89	83.09	+12.20
TrajBench (Target: Embodied AI)	Joint Exact Match (JEM)	11.48	38.25	+26.77

Experiment Figures

Radar charts comparing TrajAD against baselines across five specific domains (Math, Reasoning, Coding, Web, Embodied AI).

Ablation study on cross-domain transfer (a) and data scaling (b).

Main Takeaways

General-purpose LLMs exhibit a conservative bias, assuming agent actions are valid and missing subtle anomalies (high False Negatives).
Specialized fine-tuning (TrajAD) is essential for localization; zero-shot models fail to pinpoint error steps even when they suspect an anomaly.
The method demonstrates strong generalization to unseen domains for detection, though precise localization is more sensitive to domain shifts.
Performance scales positively with data size, validating the quality of the synthesized TrajBench dataset.

📚 Prerequisite Knowledge

Prerequisites

Language Model Agents (Reasoning, Tool Use)
Process Reward Models / Process Supervision
Supervised Fine-Tuning (SFT)
LoRA (Low-Rank Adaptation)

Key Terms

Perturb-and-Complete: A data synthesis strategy where a valid trajectory is interrupted at a specific step, a perturbation (error) is injected, and an LLM completes the trajectory to create a negative sample

Rollback-and-Retry: A recovery mechanism where an agent reverts to a state prior to a detected error and attempts to proceed again, saving resources compared to a full restart

Trajectory Anomaly Detection: The task of auditing an agent's entire execution history to classify it as normal or anomalous and localize the specific step where the error occurred

Joint Exact Match (JEM): A strict metric requiring the model to correctly predict both the error step index and the semantic content of the error explanation

Process Reward Models (PRM): Models trained to score intermediate steps of reasoning rather than just the final outcome

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of parameters