SPIDER 2.0: EVALUATING LANGUAGE MODELS ON REAL-WORLD ENTERPRISE TEXT-TO-SQL WORK

📝 Paper Summary

Text-to-SQL Benchmarking Enterprise Data Workflows

Spider 2.0 benchmarks language model agents on real-world enterprise SQL workflows involving massive schemas, diverse dialects, and project-level codebases, revealing severe limitations in current SOTA models.

Core Problem

Existing text-to-SQL benchmarks rely on small, simplified databases with uniform SQL dialects, failing to capture the complexity of enterprise environments that involve massive schemas, diverse systems (BigQuery, Snowflake), and project-level dependencies.

Why it matters:

Current LLMs achieve >90% on academic benchmarks like Spider 1.0 but fail in real industrial settings
Enterprise data is stored across diverse systems (cloud/local) requiring dialect-specific knowledge
Real-world queries often span >100 lines and require reasoning over thousands of columns and external documentation

Concrete Example: A user asks for a daily sales report. In Spider 1.0, this is a simple SELECT on one table. In Spider 2.0, the agent must navigate a Salesforce database with >1,000 columns, check `schema.yml` to understand column definitions, use dialect-specific functions like `DATE_TRUNC`, and join multiple tables while respecting project-defined macros.

Key Novelty

Enterprise-Grade Agentic SQL Benchmark

Shift from 'text-to-SQL' translation to 'SQL agents' that must explore file systems, read documentation, and interact with databases
Incorporates real-world scale: databases with 1000+ columns, nested JSON schemas, and 7 distinct SQL dialects (BigQuery, Snowflake, DuckDB, etc.)
Includes 'Spider 2.0-lite' for traditional parsing evaluation and full 'Spider 2.0' for agentic workflows involving file manipulation and iterative execution

Architecture

Conceptual diagram of the Spider 2.0 evaluation framework contrasting simplified text-to-SQL with real-world enterprise workflows.

Evaluation Highlights

o1-preview (SOTA) solves only 21.3% of agentic tasks, compared to 91.2% on Spider 1.0
Traditional text-to-SQL methods (DAIL-SQL + GPT-4o) achieve only 5.68% execution accuracy on the Lite subset
Performance drops significantly on nested schemas (10.3% success) compared to flat schemas (27.4%)

Breakthrough Assessment

9/10

A definitive reality check for the field. By moving from toy databases to massive enterprise schemas and diverse dialects, it exposes the vast gap between academic success and industrial utility.

⚙️ Technical Details

Problem Definition

Setting: Agentic SQL Workflow: Given question Q, database interface I, and codebase C, iteratively modify code based on execution observations until result A is obtained.

Inputs: Natural language question Q, Database Interface I (BigQuery/Snowflake/etc.), Project Codebase C (files, config, docs)

Outputs: Final result A (text/table/database) obtained via execution

Pipeline Flow

Group 1: Agentic Workflow (Spider 2.0)
Group 2: Text-to-SQL Workflow (Spider 2.0-lite/snow)

System Modules

Spider-Agent

Autonomous agent that navigates codebase and database to solve tasks

Model or implementation: Various LLMs (e.g., o1-preview, GPT-4o)

Text-to-SQL Parser

Generate a single SQL query from inputs

Model or implementation: Baseline methods (DIN-SQL, DAIL-SQL, etc.)

Novel Architectural Elements

Integration of full project codebases (DBT projects, YAML configs) as context for SQL generation
Multi-dialect environment requiring agents to adapt syntax for 7 different database systems dynamically
Evaluation pipeline supporting outcome verification across string, table, and database file formats

Comparison to Prior Work

vs. Spider 1.0: Spider 2.0 introduces 7 dialects (vs 1), 800+ avg columns (vs 27), and codebase context
vs. BIRD: Spider 2.0 requires understanding project files (DBT) and external docs, not just database content
vs. Intercode: Spider 2.0 focuses specifically on enterprise SQL workflows and massive schemas

Limitations

Access to commercial cloud databases (BigQuery, Snowflake) requires setup/credits, potentially hindering universal reproducibility
Evaluation scripts for table results allow some flexibility (ignoring order/extra columns), which might mask specific ordering errors
Agent performance is extremely low, making it difficult to analyze fine-grained model differences beyond 'failure'

Reproducibility

Code: https://spider2-sql.github.io

publicly available (https://spider2-sql.github.io). Includes dataset, baseline models, and evaluation scripts. Some commercial database environments (BigQuery/Snowflake) require credentials/setup, but Lite/Snow versions are self-contained.

📊 Experiments & Results

Evaluation Setup

Two distinct settings: Agentic (Spider 2.0) allowing code execution/exploration, and Static (Spider 2.0-lite/snow) strictly for text-to-SQL generation.

Benchmarks:

Spider 2.0 (Agentic SQL Workflow (Data Engineering)) [New]
Spider 2.0-lite (Text-to-SQL Generation (Multi-dialect)) [New]
Spider 2.0-snow (Text-to-SQL Generation (Snowflake only)) [New]

Metrics:

Success Rate (SR)
Execution Accuracy (EX)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Agentic evaluation shows that even the most capable reasoning models struggle significantly with real-world enterprise workflows.
Spider 2.0	Success Rate (SR)	2.53	21.36	+18.83
Spider 2.0	Success Rate (SR)	12.34	21.36	+9.02
Text-to-SQL specific evaluation (Lite) reveals that current specialized prompting methods fail on enterprise schemas.
Spider 2.0-lite	Execution Accuracy (EX)	0.73	5.68	+4.95
Spider 2.0-lite	Execution Accuracy (EX)	1.46	5.68	+4.22
Ablation on task difficulty shows nested schemas and external documentation are major failure points.
Spider 2.0 (Non-DBT subset)	Success Rate (SR)	10.34	27.38	+17.04
Spider 2.0 (Non-DBT subset)	Success Rate (SR)	11.54	26.64	+15.10

Experiment Figures

Pie chart showing the distribution of database systems in the benchmark.

Bar chart of error categories based on analysis of 300 examples.

Main Takeaways

Massive performance drop from Spider 1.0 (>90%) to Spider 2.0 (~21%), confirming that academic benchmarks do not reflect enterprise complexity
Nested schemas (JSON/Arrays) and requirement for external documentation are primary bottlenecks for current agents
Specialized Text-to-SQL methods (DIN-SQL, DAIL-SQL) fail completely (<6% accuracy) on the Lite setting due to dialect diversity and schema size
Few-shot prompting provides negligible improvement (e.g., <1% gain) on Spider 2.0-lite, suggesting current in-context learning cannot handle the massive context requirements

📚 Prerequisite Knowledge

Prerequisites

Knowledge of SQL dialects (Standard SQL vs. T-SQL vs. Snowflake)
Understanding of database schemas (tables, columns, foreign keys)
Familiarity with data warehousing concepts (ETL, DBT)

Key Terms

DBT: Data Build Tool—a framework for managing data transformations and analytics engineering code

Schema Linking: The process of mapping natural language terms in a question to specific database tables and columns

BigQuery: A fully managed, serverless enterprise data warehouse offered by Google Cloud

Snowflake: A cloud-based data warehousing platform with its own specific SQL dialect

CTE: Common Table Expression—a temporary named result set in SQL used to simplify complex queries

Dialect: A specific implementation of SQL (e.g., PostgreSQL, BigQuery Standard SQL) with unique functions and syntax

Nested Schema: Database columns that contain structured data like arrays or JSON objects within a single field

SFT: Supervised Fine-Tuning—training a model on labeled examples

Gold SQL: The ground-truth SQL query written by human experts to solve a benchmark task

Execution Accuracy (EX): A metric measuring whether the result of the predicted SQL query matches the result of the ground-truth SQL query

Success Rate (SR): The proportion of tasks where the agent's final answer matches the ground truth (used for the agentic setting)