El Agente Gráfico: Structured Execution Graphs for Scientific Agents

📝 Paper Summary

Scientific Agents Structured Execution Agentic Workflows

El Agente Gráfico replaces unstructured text-based agent contexts with typed execution graphs and a persistent knowledge graph to enable scalable, auditable, and cost-efficient scientific workflows.

Core Problem

Current scientific agents rely on unstructured text to manage context, which creates overwhelming information volume, obscures decision provenance, and causes misconfiguration when handling heterogeneous scientific tools.

Why it matters:

Numerical correctness and state fidelity are critical in science but are often lost in conversational LLM contexts
Large scientific data artifacts (e.g., electronic densities) cannot be efficiently serialized into LLM context windows
Multi-agent decomposition to handle context load introduces coordination failures and prohibitive token costs (e.g., >$4 per run in prior baselines)

Concrete Example: In a pKa prediction task, a 'bare' LLM agent (equipped only with code execution/web search) failed to check for imaginary frequencies and confused solvation models, yielding a biologically impossible pKa ≈ -5.0.

Key Novelty

Type-Safe Execution Graphs backed by Knowledge Graph Persistence

Embeds LLM decision-making within structured 'execution graphs' where nodes represent validated state transformations (e.g., DFT calculations) rather than free-form text
Uses an Object-Graph Mapper (OGM) to serialize Python objects into a persistent Knowledge Graph, allowing 'heavy' scientific data to be referenced by symbolic identifiers (IRIs) instead of raw text

Evaluation Highlights

Reduces operating cost by ~96% compared to the multi-agent 'El Agente Q' baseline ($4.67 → $0.17 per run with gpt-5)
Achieves >6x speedup in wall-clock time (1,827s → 228s) by eliminating inter-agent communication overhead and enabling parallel execution
Outperforms El Agente Q on numerical correctness, achieving 98.88% accuracy with gpt-5 compared to the baseline's 88.25%

Breakthrough Assessment

9/10

Drastically reduces the cost and complexity of scientific agents while improving accuracy. The shift from text-based context to typed execution graphs addresses the fundamental bottleneck of LLM context limits in data-heavy domains.

⚙️ Technical Details

Problem Definition

Setting: Automated execution of multi-step, heterogeneous scientific workflows (specifically quantum chemistry and materials science)

Inputs: Natural language queries or molecular structure files (e.g., xyz)

Outputs: Verified scientific properties (energies, spectra, MOF structures) and persistent knowledge graph entries

Pipeline Flow

User Request → Main Agent → Execution Graph Tool → Router Agent → Node Execution → OGM Persistence

System Modules

Main Agent

Interprets user intent, manages high-level tool selection (Graph vs. Search vs. Code), and synthesizes final answers

Model or implementation: gpt-5 / gpt-5.x / sonnet-4.5 (configurable)

Execution Graph Tool

Encapsulates specific scientific workflows (e.g., PySCF optimization) as structured directed graphs

Model or implementation: Python logic (pydantic-graph)

Router Agent

Selects the next node in the execution graph when multiple transitions are possible

Model or implementation: gpt-4o-mini (fixed for benchmarks)

Object Graph Mapper (OGM)

Serializes/deserializes Python objects to/from the Knowledge Graph

Model or implementation: Customized 'twa' package

Novel Architectural Elements

Structured Execution Graphs as Tools: Workflows are not generated code but pre-defined, typed graph structures that the agent navigates
State decoupling via OGM: Execution state is decoupled from the LLM context window using typed IRIs, allowing massive data to persist without re-serialization overhead

Modeling

Base Model: Evaluated multiple models: gpt-5, gpt-5.1, gpt-5.2, gpt-4.1, sonnet-3.7, sonnet-4.5, minimax-m2, qwen3-max

Compute: Evaluation performed on a compute node with 4 H100 GPUs

Comparison to Prior Work

vs. El Agente Q: Gráfico is single-agent with structured graph execution, reducing cost by 96% and increasing speed 6x while improving accuracy
vs. DREAMS: Gráfico uses an OGM for deep semantic persistence of objects rather than just tracking file paths/status [not cited in paper as direct baseline, but mentioned in intro]

Limitations

Current design assumes a single-session context; extending to distributed/long-horizon tasks requires synchronization of shared Knowledge Graphs
Couples workflows through global GPU contexts, requiring careful parallelism configuration
Building the underlying ontologies and execution graphs is labor-intensive compared to prompt-only agents

📊 Experiments & Results

Evaluation Setup

Benchmark of 6 university-level quantum chemistry exercises (2 difficulty levels each), repeated 10 times (120 runs total per model).

Benchmarks:

Quantum Chemistry Exercises (Scientific Calculation & Reasoning)

Metrics:

Numerical Evaluation Score (%)
LLM Judge Evaluation Score (%)
Trace Tokens
Token Cost (USD)
Task Duration (s)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison showing Gráfico's efficiency gains over the multi-agent baseline (El Agente Q) and differences between various LLM backbones.
Quantum Chemistry Exercises	Token Cost (USD)	4.67	0.17	-4.50
Quantum Chemistry Exercises	Task Duration (s)	1827	228	-1599
Quantum Chemistry Exercises	Numerical eval. (%)	88.25	98.88	+10.63
Quantum Chemistry Exercises	Trace tokens	1649616	83613	-1566003
Quantum Chemistry Exercises	Numerical eval. (%)	93.71	98.88	+5.17

Main Takeaways

Single-agent architectures with structured execution graphs can outperform multi-agent systems in scientific tasks by eliminating coordination overhead.
Cost of intelligence decreases across GPT generations; gpt-5 achieves higher accuracy with fewer tokens than gpt-4.1.
Distinct interaction patterns exist between model families: GPT models tend to batch tool calls efficiently, while Claude (Sonnet) models prefer interleaving reasoning and serial tool use, increasing costs.

📚 Prerequisite Knowledge

Prerequisites

Agentic AI architectures (tool use, routing)
Knowledge Graphs and Ontologies
Computational Chemistry (DFT, geometry optimization)
Python type systems (Pydantic)

Key Terms

OGM: Object-Graph Mapper—a layer that translates between in-memory Python objects and their representation in a graph database

IRI: Internationalized Resource Identifier—a unique symbolic identifier used to reference objects in the knowledge graph without reloading their full data

Execution Graph: A directed graph where nodes represent specific computational steps (e.g., 'Optimization', 'Frequency Calc') and edges define admissible data flows

ConceptualAtoms: A unified in-memory abstraction class for representing molecular and periodic systems across different software packages

pass@k: A metric measuring the probability that at least one correct solution is generated within k attempts

GPU4PySCF: A GPU-accelerated version of the PySCF quantum chemistry package

DFT: Density Functional Theory—a computational quantum mechanical modelling method used to investigate the electronic structure of systems

MOF: Metal-Organic Framework—a class of compounds consisting of metal ions or clusters coordinated to organic ligands

CIF: Crystallographic Information File—a standard text file format for representing crystallographic information