Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

📝 Paper Summary

Multimodal Agents Data Science Benchmarks GUI Automation

Spider2-V is a benchmark evaluating multimodal agents on professional data science workflows, revealing that current SOTA models achieve only 14% success due to complex GUI and coding requirements.

Core Problem

Existing agents are evaluated on simple API-based or daily tasks, failing to capture the complexity of real-world data science workflows that require orchestrating enterprise software via both Code and GUI.

Why it matters:

Data workflows rely on complex enterprise tools (BigQuery, Airbyte) where API access is often insufficient or requires GUI interaction
Current benchmarks ignore the 'data engineering' and 'data orchestration' stages, focusing only on analysis or coding in isolation
Automating these workflows could democratize large-scale data analysis, but current models struggle with the necessary fine-grained visual grounding

Concrete Example: A task requires saving the top 20 dramatic movies from a Snowflake database to a CSV. The agent must navigate the Snowflake Web GUI to create a worksheet, write SQL, execute it, and then use the OS GUI to rename the downloaded file—a hybrid workflow current agents fail.

Key Novelty

Full-Stack Data Science Agent Benchmark in Executable OS

Constructs 494 real-world tasks covering the entire data pipeline (warehousing, ingestion, transformation, orchestration) within a live Virtual Machine
Integrates 20 professional enterprise applications (e.g., dbt, Airflow, Dagster) requiring authentic interaction rather than mock APIs
Combines visual GUI control (clicking, dragging) with text-based coding (SQL, Python) in a single evaluation framework

Architecture

The annotation and task construction pipeline, illustrating how tasks are derived from official tutorials and implemented in the VM.

Evaluation Highlights

State-of-the-art VLM (GPT-4V) achieves only 14.0% success rate across the full benchmark
Performance on 'Hard' tasks (requiring >15 steps) drops to 1.2% for GPT-4o, indicating an inability to handle long-horizon workflows
Providing verbose, step-by-step instructions only improves GPT-4o performance from 11.3% to 16.2%, showing that execution grounding remains a bottleneck

Breakthrough Assessment

9/10

A highly rigorous, realistic benchmark that exposes a massive gap in current agent capabilities. The integration of enterprise tools and executable environments sets a new standard for data agent evaluation.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) in a desktop OS environment

Inputs: Natural language instruction, current screenshot, accessibility tree (a11y tree)

Outputs: Executable actions (CLICK(x,y), TYPE(text), etc.)

Pipeline Flow

Observation (Screenshot + A11y Tree)
Reasoning (VLM predicts next action)
Execution (PyAutoGUI performs action on VM)
Evaluation (Check final state)

System Modules

Environment

Host the virtual machine with 20 installed enterprise applications and internet access

Model or implementation: Ubuntu VM (based on OSWorld)

Agent (Baseline)

Process observations and generate actions

Model or implementation: GPT-4V / GPT-4o / Claude-3 (via API)

Novel Architectural Elements

Integration of 20 professional enterprise data tools (e.g., Snowflake, Tableau) into a unified executable evaluation environment
Hybrid evaluation metrics combining file-based comparison, information-based validation (via API), and execution-based verification (running scripts)

Modeling

Base Model: GPT-4V (preview), GPT-4o, Gemini-1.5-Pro, Claude-3-Opus

Training Method: Zero-shot prompting with RAG and Execution Feedback

Adaptation: None (Prompt engineering only)

Trainable Parameters: 0 (Inference only)

Key Hyperparameters:

temperature: 0.5
top_p: 0.9
history_window_size: 3
+ 1 more
max_interaction_turns: 15

Compute: Not reported in the paper

Comparison to Prior Work

vs. OSWorld: Spider2-V focuses on domain-specific enterprise data software (Airflow, dbt) requiring specialized knowledge vs. general OS tasks
vs. Spider: Spider2-V involves full workflows (ingestion, visualization) and GUI interactions, not just SQL query generation
vs. DS-1000: Spider2-V includes cloud-hosted services and multi-app orchestration, whereas DS-1000 is limited to local Python libraries
+ 1 more
vs. WorkArena [not cited in paper]: WorkArena focuses on ServiceNow specifically, while Spider2-V spans the entire data engineering stack (20+ apps)

Limitations

Heavy reliance on proprietary, closed-source models (GPT-4) for any meaningful performance
Interaction latency with cloud-hosted services can cause instability in agent evaluation
Authentic account requirement for 34% of tasks hinders easy reproducibility for all users
Maximum trajectory length is heuristically capped at 15 turns, which may be too short for complex orchestration

Reproducibility

Code: https://spider2-v.github.io

Code and data available at https://spider2-v.github.io. 34% of tasks require authentic user accounts (e.g., Snowflake, BigQuery) which may complicate reproduction due to cost or access requirements. Document warehouse of 11,231 docs provided.

📊 Experiments & Results

Evaluation Setup

Agents interact with a live Ubuntu VM to complete data tasks. Success is binary (0/1).

Benchmarks:

Spider2-V (Multimodal Data Science Agent Tasks) [New]

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
State-of-the-art closed-source models significantly outperform open-source models, but overall performance remains low.
Spider2-V	Success Rate	2.0	14.0	+12.0
Spider2-V	Success Rate	9.1	14.0	+4.9
Ablation studies reveal that detailed instructions and easier task definitions improve performance, but hard tasks remain unsolved.
Spider2-V (GPT-4o)	Success Rate	11.3	16.2	+4.9
Spider2-V (GPT-4o)	Success Rate	1.2	38.8	+37.6

Experiment Figures

Examples of multimodal agent tasks in Spider2-V, showing the mix of GUI and Code.

Main Takeaways

Current multimodal agents are far from automating professional data workflows, with the best model achieving only 14% success.
Tasks involving authentic user accounts (cloud services) are significantly harder (10.6% success) than local tasks (15.6%), likely due to network latency and complex UIs.
Visual modality is critical; agents using only text-based accessibility trees (a11y tree) perform poorly compared to those using Set-of-Mark on screenshots.
Fine-grained GUI control (e.g., drag-and-drop in Excel) is a major failure mode compared to pure code generation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of multimodal agents (VLMs)
Familiarity with data engineering pipelines (ETL/ELT)
Basic knowledge of GUI automation techniques

Key Terms

GUI: Graphical User Interface—visual components like buttons and windows that users interact with

CLI: Command Line Interface—text-based interface for executing commands

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot see the full state of the world

a11y tree: Accessibility Tree—a structured text representation of UI elements (buttons, inputs) and their properties, used by screen readers and agents

VLM: Vision Language Model—AI models that can process both image and text inputs

DAG: Directed Acyclic Graph—a conceptual representation of a workflow where data flows in one direction without loops, common in tools like Airflow

Set-of-Mark: A prompting technique where visible UI elements on a screenshot are overlaid with numbered bounding boxes to help the model reference specific coordinates

dbt: data build tool—a framework for transforming data in warehouses using SQL

Airbyte: An open-source data integration platform for moving data from sources to destinations

Airflow: A platform to programmatically author, schedule, and monitor workflows