← Back to Paper List

SPIDER 2.0: EVALUATING LANGUAGE MODELS ON REAL-WORLD ENTERPRISE TEXT-TO-SQL WORK

Unknown authors
University of Hong Kong, Salesforce Research, Sea AI Lab, Google Deepmind, Google Cloud AI Research, University of Waterloo
Agent Benchmark

📝 Paper Summary

Text-to-SQL Benchmarking Enterprise Data Workflows
Spider 2.0 benchmarks language model agents on real-world enterprise SQL workflows involving massive schemas, diverse dialects, and project-level codebases, revealing severe limitations in current SOTA models.
Core Problem
Existing text-to-SQL benchmarks rely on small, simplified databases with uniform SQL dialects, failing to capture the complexity of enterprise environments that involve massive schemas, diverse systems (BigQuery, Snowflake), and project-level dependencies.
Why it matters:
  • Current LLMs achieve >90% on academic benchmarks like Spider 1.0 but fail in real industrial settings
  • Enterprise data is stored across diverse systems (cloud/local) requiring dialect-specific knowledge
  • Real-world queries often span >100 lines and require reasoning over thousands of columns and external documentation
Concrete Example: A user asks for a daily sales report. In Spider 1.0, this is a simple SELECT on one table. In Spider 2.0, the agent must navigate a Salesforce database with >1,000 columns, check `schema.yml` to understand column definitions, use dialect-specific functions like `DATE_TRUNC`, and join multiple tables while respecting project-defined macros.
Key Novelty
Enterprise-Grade Agentic SQL Benchmark
  • Shift from 'text-to-SQL' translation to 'SQL agents' that must explore file systems, read documentation, and interact with databases
  • Incorporates real-world scale: databases with 1000+ columns, nested JSON schemas, and 7 distinct SQL dialects (BigQuery, Snowflake, DuckDB, etc.)
  • Includes 'Spider 2.0-lite' for traditional parsing evaluation and full 'Spider 2.0' for agentic workflows involving file manipulation and iterative execution
Architecture
Architecture Figure Figure 1
Conceptual diagram of the Spider 2.0 evaluation framework contrasting simplified text-to-SQL with real-world enterprise workflows.
Evaluation Highlights
  • o1-preview (SOTA) solves only 21.3% of agentic tasks, compared to 91.2% on Spider 1.0
  • Traditional text-to-SQL methods (DAIL-SQL + GPT-4o) achieve only 5.68% execution accuracy on the Lite subset
  • Performance drops significantly on nested schemas (10.3% success) compared to flat schemas (27.4%)
Breakthrough Assessment
9/10
A definitive reality check for the field. By moving from toy databases to massive enterprise schemas and diverse dialects, it exposes the vast gap between academic success and industrial utility.
×