← Back to Paper List

Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance

Jacopo Tagliabue, Federico Bianchi, Ciro Greco
Bauplan
arXiv (2025)
Agent Reasoning

📝 Paper Summary

AI Infrastructure Agent Safety & Governance
Reliable AI agents require a new lakehouse architecture that enforces data consistency through database-style isolation primitives—branching, atomic merges, and sandboxed compute—rather than relying on model intelligence.
Core Problem
Traditional data lakehouses, designed for human teams, lack the isolation mechanisms needed for swarms of concurrent, untrusted AI agents, leading to race conditions and data corruption.
Why it matters:
  • Current agents operating on lakehouses can irreversibly corrupt production data (e.g., dropping tables, polluting data with hallucinations) due to lack of transactional boundaries
  • Standard pipelines (e.g., Airflow) manage compute but treat data writes as side effects, making it impossible to rollback partial failures in multi-step agent workflows
  • Governance is currently manual and scattered across tools; scaling to autonomous agents requires programmatic, API-enforced safety guarantees
Concrete Example: Consider a 3-step pipeline updating tables A, B, and C. If an agent's code fails at step 3, a standard lakehouse leaves tables A and B updated but C old, creating an inconsistent state for downstream users. The proposed system isolates the entire run in a temporary branch, so the failure leaves production data untouched.
Key Novelty
Bauplan (The Agentic Lakehouse)
  • Re-implements Multi-Version Concurrency Control (MVCC) for distributed lakehouses by treating data tables like Git repositories: every agent workflow runs on a temporary branch that is atomically merged only upon success
  • Unifies compute and storage isolation via a serverless Function-as-a-Service (FaaS) model where input/output tables are declared explicitly, allowing the platform to enforce strict access control
Architecture
Architecture Figure Figure 3
A self-healing pipeline workflow using the Agentic Lakehouse architecture
Breakthrough Assessment
8/10
A strong position paper proposing a fundamental architectural shift. While it lacks empirical ML benchmarks, it offers a necessary infrastructure primitive (Git-like semantics for data) to make agentic engineering viable.
×