Evaluation Setup
Performance prediction on a benchmark of agentic workflows
Benchmarks:
- Benchmark spanning three domains (Agentic Workflow Execution) [New]
Metrics:
- Kendall's Tau (Rank Correlation)
- Pearson Correlation
- RMSE (Root Mean Square Error)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Averaged prediction accuracy results across three domains demonstrate the superiority of the Multi-View approach. |
| Average across 3 domains |
Prediction Accuracy (Correlation) |
Not reported as single aggregate |
Not reported as single aggregate |
-
|
| Average across 3 domains |
Workflow Utility |
Not reported as single aggregate |
Not reported as single aggregate |
-
|
Main Takeaways
- Multi-view encoding (Graph + Code + Prompt) consistently outperforms single-view graph baselines.
- Cross-domain unsupervised pretraining significantly improves performance when labeled data is scarce.
- The predictor effectively ranks workflows, enabling efficient search without exhaustive evaluation.