Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models

📝 Paper Summary

Agentic AI Security Trace-based Analysis

This paper introduces a reproducible framework for training small language models to detect malicious multi-step agent behaviors by fine-tuning on synthetic OpenTelemetry traces and curated cybersecurity datasets.

Core Problem

Existing AI safety mechanisms focus on single-turn text generation (like prompt injection) but fail to detect multi-step attack patterns that emerge across agent workflows, such as stealth privilege escalation or multi-agent coordination attacks.

Why it matters:

Commercial vendors use trace-based monitoring but keep their methodologies closed, preventing practitioners from building custom security models adapted to specific threat landscapes
Benign individual actions (e.g., 'list directory') can be malicious in aggregate (e.g., reconnaissance), requiring temporal context that single-prompt safety filters miss
Current benchmarks focus on harmful task completion rather than the trace-based behavioral analysis needed for operational security monitoring

Concrete Example: A workflow like 'read_file(/etc/passwd) → http_request(attacker.com)' represents data exfiltration. While 'read_file' might be benign in isolation, the sequence reveals malicious intent. Standard safety filters examining only the 'read_file' prompt would miss the broader attack context.

Key Novelty

Trace-Based Temporal Pattern Detection via Fine-Tuning

Treats security monitoring as a sequence modeling problem by fine-tuning LLMs on OpenTelemetry traces, allowing the model to analyze timestamps, agent IDs, and tool outputs collectively
Generates synthetic 'attack traces' using templates to simulate complex multi-agent scenarios (coordination attacks, regulatory violations) that are scarce in public datasets

Architecture

Conceptual flow of the security monitoring pipeline: Raw OpenTelemetry Traces → Trace Parser → Prompt Construction → Fine-Tuned LLM → Security Verdict

Evaluation Highlights

+31.4% accuracy improvement (42.86% → 74.29%) on a custom cybersecurity benchmark after iterative fine-tuning
Achieved statistically significant gains (p < 0.001) using only 0.148 epochs of training on ARM64 hardware
Demonstrated that adding just 30 targeted adversarial examples yielded a +7.2 point gain in the final refinement stage

Breakthrough Assessment

7/10

First open methodology for trace-based agentic security with strong educational value. However, the model suffers from severe false positives (66.7%) in practice, limiting autonomous deployment.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of agentic workflow traces as BENIGN or MALICIOUS (with reasoning)

Inputs: OpenTelemetry-formatted workflow traces containing timestamps, agent identifiers, tool invocations, parameters, and status codes

Outputs: Classification label (BENIGN/SUSPICIOUS/MALICIOUS) and natural language reasoning

Pipeline Flow

Trace Collection (OpenTelemetry logs)
Prompt Formatting (Standardized BENIGN/SUSPICIOUS/MALICIOUS prompt)
Inference (Foundation-Sec-8B model)
Classification Output (Label + Reasoning)

System Modules

Trace Analyzer

Analyze the sequential log data to identify potential attack patterns

Model or implementation: Foundation-Sec-1.1-8B-Instruct (Llama 3.1 based) with QLoRA adapters

Novel Architectural Elements

Integration of OpenTelemetry trace structures directly into the LLM training corpus to treat operational logs as natural language sequences

Modeling

Base Model: Foundation-Sec-1.1-8B-Instruct (Llama 3.1, 8.03B params)

Training Method: Supervised Fine-Tuning (SFT) via QLoRA

Adaptation: QLoRA (4-bit NF4 quantization, rank 16, alpha 32)

Trainable Parameters: LoRA adapters (exact count not reported, but rank 16 applied)

Training Data:

Total: 80,851 examples
Synthetic Traces: 35,026 (Claude Sonnet 4.5 generated)
Public Datasets: 45,825 (from 18 sources including AgentHarm, Agent-SafetyBench)
Continuation V3: +111 OWASP examples
Continuation V4: +30 adversarial examples

Key Hyperparameters:

learning_rate: 2e-4 (V2), 1e-4 (V3/V4)
batch_size: 8
optimizer: AdamW 8-bit
+ 4 more
precision: BF16
quantization: 4-bit NF4
lora_rank: 16
steps: 1500 (V2), 500 (V3), 500 (V4)

Compute: NVIDIA DGX Spark (Blackwell ARM64, 128GB memory). V2 training took 6h 43m. V3/V4 took 30m each.

Comparison to Prior Work

vs. SIEM: Uses LLM for semantic reasoning over logs rather than regex/rules
vs. SafetyBench: Evaluates dynamic multi-step traces rather than static input/output pairs
vs. General LLM Safety: Focuses on temporal aggregation of benign actions into malicious patterns
+ 1 more
vs. Llama Guard [not cited in paper]: Llama Guard classifies single turns; this framework aggregates context across timestamps and agents

Limitations

High False Positive Rate (66.7%) on benign workflows due to attack-skewed training data
Zero Specificity (0% TNR) in practical trace validation; model tends to flag everything as suspicious/malicious
Limited validation sample size (30 traces) for the practical evaluation
Requires human-in-the-loop; not suitable for automated blocking

Reproducibility

Highly reproducible. Datasets (training_data_v3_synthetic.jsonl, etc.), training scripts, and benchmarks are released on HuggingFace. The paper documents ARM64-specific workarounds. Synthetic generation templates are described.

📊 Experiments & Results

Evaluation Setup

Dual evaluation: (1) Knowledge retention via Multiple Choice QA, (2) Practical classification of OpenTelemetry traces

Benchmarks:

MMLU Computer Security (Knowledge QA)
Custom Agentic Security Benchmark (Knowledge QA (OWASP/Microsoft Taxonomy)) [New]
Practical Trace Validation (Binary Classification of Logs) [New]

Metrics:

Accuracy
True Positive Rate (Sensitivity)
True Negative Rate (Specificity)
False Positive Rate
Statistical methodology: McNemar’s test for significance (p < 0.001), Cohen’s h for effect size, 95% Confidence Intervals reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Iterative fine-tuning led to significant improvements in knowledge benchmarks, particularly for agentic security concepts.
Custom Agentic Security Benchmark	Accuracy	42.86	74.29	+31.43
Custom Agentic Security Benchmark	Accuracy	61.4	67.1	+5.7
Custom Agentic Security Benchmark	Accuracy	67.1	74.3	+7.2
Practical evaluation on real traces revealed severe over-sensitivity despite high benchmark accuracy.
Practical Trace Validation (30 traces)	True Positive Rate (Sensitivity)	Not reported in the paper	60.0	Not reported in the paper
Practical Trace Validation (30 traces)	True Negative Rate (Specificity)	Not reported in the paper	0.0	Not reported in the paper

Main Takeaways

Dataset composition determines behavior: A 90% attack-focused dataset resulted in a hyper-aggressive model with 66.7% False Positive Rate, proving that indiscriminate scaling of malicious examples harms practical utility
Targeted iterative refinement works: Adding just 30 adversarial examples (V4) yielded greater accuracy gains (+7.2) than larger initial batches, validating a 'knowledge gap' driven approach
Prompt engineering cannot fix training bias: Enhanced prompting with few-shot examples failed to reduce false positives (stayed at 66.7%), indicating that learned representations from imbalanced data persist regardless of inference instructions
Strong disconnect between MCQA and Practice: The model achieved high scores on multiple-choice benchmarks (74.29%) while failing to correctly classify any benign workflows in practice (0% specificity)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and fine-tuning
Familiarity with OpenTelemetry and system logging
Basic cybersecurity concepts (OWASP Top 10, MITRE ATT&CK)

Key Terms

OpenTelemetry: An open-source observability framework for generating and collecting telemetry data (logs, metrics, traces) from software

QLoRA: Quantized Low-Rank Adaptation—a technique to fine-tune large models efficiently by freezing most parameters and training small adapters in low precision

FPR: False Positive Rate—the percentage of benign (safe) items incorrectly flagged as malicious

TPR: True Positive Rate (Sensitivity)—the percentage of malicious items correctly identified as malicious

TNR: True Negative Rate (Specificity)—the percentage of benign items correctly identified as safe

Agentic AI: AI systems capable of autonomous planning, reasoning, and tool use to achieve high-level goals

Trace-based analysis: Security monitoring that looks at the chronological sequence of system events (traces) rather than just isolated inputs

RAG: Retrieval-Augmented Generation—providing an LLM with external knowledge (documents) during inference to improve accuracy

NF4: NormalFloat 4-bit—a specific data type for quantization that optimizes the dynamic range for neural network weights

BF16: BFloat16—a floating-point format that retains the dynamic range of 32-bit floats with reduced precision, commonly used in ML training