Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows

📝 Paper Summary

Automated Generation of Agentic Workflows Performance Prediction for Agents

Agentic Predictor accelerates the design of multi-agent systems by using a lightweight neural network to estimate the performance of candidate workflows without running expensive LLM-based evaluations.

Core Problem

Finding optimal configurations for agentic workflows (e.g., topology, prompts, tools) is computationally prohibitive because evaluating each candidate requires expensive, slow, and repeated execution of Large Language Models.

Why it matters:

Developing effective agentic systems currently relies on trial-and-error manual engineering or costly search algorithms that waste computational resources validating poor candidates.
Existing automated methods (like GPTSwarm or ADAS) incur massive API costs by fully executing every candidate workflow during the search process.
Labeled data for agentic workflows (successful vs. failed runs) is extremely scarce, making it difficult to train standard supervised predictors.

Concrete Example: When designing a coding agent, a search algorithm might generate thousands of variations in communication patterns (e.g., 'Debate' vs. 'Code-Review'). Evaluating all of them requires running GPT-4 on a benchmark for every single variation, costing hundreds of dollars. Agentic Predictor estimates the success rate of these variations instantly without execution.

Key Novelty

Agentic Predictor (Multi-View Encoder + Cross-Domain Pretraining)

Encodes workflows using three complementary views: the 'Graph View' for agent topology, the 'Code View' for logic/control flow, and the 'Prompt View' for semantic instructions.
Uses cross-domain unsupervised pretraining to learn generalizable workflow representations from unlabeled data, allowing the predictor to work effectively even with very few ground-truth performance labels.

Architecture

The Agentic Predictor framework, illustrating the Multi-View Encoder, Pretraining phase, and Predictor phase.

Evaluation Highlights

Improves prediction accuracy by up to 6.90% over strong graph-based baselines (averaged across three domains).
Increases workflow utility (ranking quality) by up to 5.87% compared to baselines.
Outperforms GNN-based baselines like Graph Transformer and GIN in predicting the success of unseen agentic workflows.

Breakthrough Assessment

7/10

Novel application of Neural Architecture Search (NAS) principles to Agentic Workflows. The multi-view encoding and pretraining strategy effectively addresses the unique heterogeneity and data scarcity of agent systems.

⚙️ Technical Details

Problem Definition

Setting: Performance prediction for agentic workflows represented as Directed Acyclic Graphs (DAGs)

Inputs: A candidate workflow W = {V, E, P, C} (agents, edges, prompts, code) and a task description T

Outputs: Estimated performance score ê (e.g., success probability)

Pipeline Flow

Multi-View Encoding: Graph Encoder → Code Encoder → Prompt Encoder
Aggregation: Concatenate views → MLP Fusion
Task Integration: Task Description → Task Encoder
Prediction: Fused Workflow + Task Embedding → MLP Predictor

System Modules

Graph Encoder (Workflow Encoding)

Encodes structural dependencies and agent interactions

Model or implementation: GNN with Cross-View Self-Attention

Code Encoder (Workflow Encoding)

Encodes program-level semantics and control flow

Model or implementation: L-layer MLP

Prompt Encoder (Workflow Encoding)

Encodes semantic instructions and agent roles

Model or implementation: L-layer MLP

Performance Predictor

Estimates the performance score of the workflow

Model or implementation: MLP (Multi-Layer Perceptron)

Novel Architectural Elements

Multi-view integration of Graph (topology), Code (logic), and Prompt (semantics) for agentic workflows
Cross-View Self-Attention block within the Graph Encoder to mix features from prompt, code, and operator graphs at the node level

Modeling

Base Model: Custom Multi-View Encoder architecture (GNN + MLPs)

Training Method: Two-stage training: (1) Unsupervised Pretraining, (2) Supervised Fine-tuning

Objective Functions:

Purpose: Reconstruct input modalities from the latent representation.

Formally: L_rec = Mean Squared Error between input (G, C, P) and reconstructed (G^, C^, P^)
Purpose: Align representations of the same workflow across different views (e.g., Graph vs. Code).

Formally: L_con = Contrastive Loss (InfoNCE) maximizing similarity of positive pairs and minimizing negatives
Purpose: Minimize error between predicted and ground-truth performance.

Formally: L_pred = Cross-Entropy (for binary) or MSE (for regression)

Key Hyperparameters:

learning_objective: Reconstruction + Contrastive + Prediction

Compute: Not reported in the paper

Comparison to Prior Work

vs. FLORA-Bench/GNNs: Agentic Predictor uses Multi-View Encoding (Code + Prompt + Graph) rather than just a single graph view
vs. Standard Predictors: Agentic Predictor uses Cross-Domain Unsupervised Pretraining to handle label scarcity
vs. NAS Predictors (e.g., BRP-NAS): Specifically adapted for agentic workflows by including prompt and code semantics [not cited in paper but implied context]

Limitations

Relies on the availability of unlabeled workflow data for pretraining
Evaluation is limited to three specific domains in the benchmark
Does not propose a new search algorithm, only the predictor component
Computational cost of pretraining is not explicitly analyzed

Reproducibility

No code URL provided in the paper. The method relies on datasets and benchmarks (like FLORA-Bench) but does not link to a specific repository for the Agentic Predictor implementation.

📊 Experiments & Results

Evaluation Setup

Performance prediction on a benchmark of agentic workflows

Benchmarks:

Benchmark spanning three domains (Agentic Workflow Execution) [New]

Metrics:

Kendall's Tau (Rank Correlation)
Pearson Correlation
RMSE (Root Mean Square Error)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Averaged prediction accuracy results across three domains demonstrate the superiority of the Multi-View approach.
Average across 3 domains	Prediction Accuracy (Correlation)	Not reported as single aggregate	Not reported as single aggregate	-
Average across 3 domains	Workflow Utility	Not reported as single aggregate	Not reported as single aggregate	-

Main Takeaways

Multi-view encoding (Graph + Code + Prompt) consistently outperforms single-view graph baselines.
Cross-domain unsupervised pretraining significantly improves performance when labeled data is scarce.
The predictor effectively ranks workflows, enabling efficient search without exhaustive evaluation.

📚 Prerequisite Knowledge

Prerequisites

Graph Neural Networks (GNNs)
Neural Architecture Search (NAS) concepts
Basics of Large Language Model (LLM) agents
Contrastive Learning

Key Terms

Agentic Workflow: A system of multiple LLM-based agents connected in a specific topology (DAG) to solve complex tasks

NAS: Neural Architecture Search—automating the design of neural networks; here adapted to automating the design of agent systems

DAG: Directed Acyclic Graph—a graph structure with no loops, used here to represent the flow of information between agents

Contrastive Learning: A self-supervised learning technique that trains a model to pull representations of similar items closer and push dissimilar ones apart

Multi-view Encoder: A neural network architecture that processes different data modalities (graph, code, text) separately before fusing them

Kendall's Tau: A rank correlation coefficient used to measure how well the predictor ranks workflows compared to their true performance order