IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

📝 Paper Summary

Conversational AI Evaluation Synthetic Data Generation Multi-Agent Simulation

IntellAgent automates the evaluation of conversational AI by using a policy-driven graph to generate diverse, synthetic multi-turn scenarios that test agents against complex policy constraints and tool usage.

Core Problem

Evaluating conversational agents is challenging because they must navigate complex multi-turn dialogues and strict policies, yet existing benchmarks are static, small-scale, and manually curated.

Why it matters:

Manual benchmarks like tau-bench are expensive to scale (containing only ~50-100 samples), limiting the ability to test edge cases
Standard evaluation metrics are often coarse-grained (pass/fail), failing to diagnose specific policy violations or tool misuse
Real-world deployment requires high reliability in enforcing business rules (e.g., refunds, auth), which current static datasets cannot adequately stress-test

Concrete Example: A user might request a flight modification that triggers two conflicting rules: 'cannot add insurance after booking' and 'must verify ID first.' A simple benchmark might miss whether the agent correctly prioritizes ID verification before addressing the insurance policy, whereas IntellAgent explicitly tests this interaction.

Key Novelty

Graph-based Policy Modeling for Event Generation

Models domain policies as a graph where nodes are policies and edges represent the likelihood of co-occurrence, derived via LLM scoring
Generates scenarios by performing random walks on this graph, allowing precise control over interaction complexity (sum of policy weights) and diversity
Uses a symbolic entity generator to create self-consistent synthetic databases (e.g., users, reservations) that support the generated scenarios

Architecture

Overview of the IntellAgent pipeline from input schema to final report

Evaluation Highlights

Strong correlation with human-curated tau-bench: Pearson coefficients of 0.98 (Airline) and 0.92 (Retail), validating the synthetic approach
Scalability: Generated 1,000 diverse events per domain compared to tau-bench's 50 (Airline) and 115 (Retail), enabling fine-grained complexity analysis
Identifies specific weakness: All tested models (including GPT-4o) struggle significantly with 'user consent' policies, a category not covered by existing manual benchmarks

Breakthrough Assessment

8/10

Highly practical framework that solves the data bottleneck in agent evaluation. The strong correlation with manual benchmarks suggests it can replace expensive human curation for stress-testing complex agents.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of task-oriented conversational agents involving database interactions and policy constraints

Inputs: Database schema, chatbot system prompt (or policy document)

Outputs: Comprehensive performance report including success rates, policy adherence analysis, and failure diagnostics

Pipeline Flow

Policy Graph Construction (extracts policies, builds weighted graph)
Event Generation (samples policy paths, generates scenarios + DB states)
Simulation (User Agent interacts with Candidate Agent)
Critique (Analyzes logs for success/policy adherence)

System Modules

Policy Graph Constructor

Parses system prompts to identify policies and builds a graph where edge weights (1-10) reflect co-occurrence likelihood

Model or implementation: GPT-4o (used for extraction and scoring)

Event Generator Agent

Samples a path from the policy graph and generates a coherent user request and corresponding initial database state

Model or implementation: GPT-4o

User Agent (Simulator) (Evaluation Phase)

Simulates the end-user interacting with the chatbot, driven by the generated scenario goals

Model or implementation: GPT-4o

Dialog Critique (Evaluation Phase)

Analyzes the completed dialog to verify if the goal was met and if all policies were respected

Model or implementation: GPT-4o

Novel Architectural Elements

Graph-based sampling for complexity control: unlike random sampling, this ensures realistic policy combinations via weighted random walks
Symbolic entity instantiation: decouples entity logic from raw database schema to handle complex constraints automatically

Modeling

Base Model: GPT-4o (used as the engine for all internal IntellAgent components: generation, simulation, critique)

Compute: Not reported in the paper (Framework is inference-based; depends on the model being tested)

Comparison to Prior Work

vs. tau-bench: Fully automated synthetic generation (scales to 1000+ events) vs. manual curation (~100 events); fine-grained complexity control via graph modeling
vs. ALMITA: Generalizes to any domain via schema/policy inputs vs. focus on customer support workflows
vs. RAGAS: Evaluates multi-turn tool-use agents and policy adherence vs. single-turn retrieval/generation quality

Limitations

Relies on the quality of the LLM (GPT-4o) used for simulation and critique; biases in the generator model could propagate to evaluation
Synthetic data might not perfectly capture the 'long tail' of messy human irrationality found in real logs
The graph construction depends on the clarity of the input policy documents; ambiguous policies may lead to poor graph structures

Reproducibility

Code: https://github.com/plurai-ai/intellagent

publicly available (https://github.com/plurai-ai/intellagent). The framework relies on GPT-4o for generation and simulation. Benchmark environments (Airline, Retail) are adapted from tau-bench.

📊 Experiments & Results

Evaluation Setup

Task-oriented dialogues in Retail and Airline domains, involving database reading/writing and policy enforcement

Benchmarks:

IntellAgent Benchmark (Synthetic) (Conversational Agent Evaluation) [New]
tau-bench (Conversational Agent Evaluation)

Metrics:

Success Rate (Pass/Fail based on goal completion and policy adherence)
Pearson Correlation (between IntellAgent and tau-bench scores)
Statistical methodology: Pearson correlation coefficient reported to validate alignment with tau-bench

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Validation results showing that IntellAgent's synthetic evaluation strongly correlates with the manually curated tau-bench.
tau-bench (Airline)	Pearson Correlation	1.0	0.98	0.02
tau-bench (Retail)	Pearson Correlation	1.0	0.92	0.08

Main Takeaways

Model performance declines consistently as scenario complexity (sum of policy weights) increases, but the rate of decline varies by model (e.g., Gemini-1.5-pro maintains performance longer than GPT-4o-mini)
High correlation with manual benchmarks (0.92-0.98) proves that purely synthetic, graph-driven evaluation is a reliable proxy for human-curated tests
Policy-specific analysis reveals hidden gaps: nearly all models fail on 'User Consent' policies, a blind spot in previous benchmarks like tau-bench
Weighted probability sampling in the policy graph balances diversity and realism better than uniform or max-weight sampling

📚 Prerequisite Knowledge

Prerequisites

Understanding of task-oriented dialogue systems
Familiarity with tool-use agents (function calling)
Basic graph theory (nodes, edges, random walks)

Key Terms

Policy Graph: A graph where nodes represent business policies and edges represent the likelihood/naturalness of them appearing together in a single conversation

tau-bench: A manually curated benchmark for conversational agents used as a baseline to validate IntellAgent's synthetic results

Symbolic Representation: An intermediate abstraction used by the Event Generator to define entities (e.g., 'User A', 'Flight B') before populating specific database rows, ensuring consistency

Random Walk: A method of traversing the policy graph to sample a sequence of policies that forms the basis of a synthetic user scenario

Langgraph: The framework used to implement the multi-agent orchestration in IntellAgent