Safe, Untrusted, "Proof-Carrying" AI Agents: toward the agentic lakehouse

📝 Paper Summary

Infrastructure for Agents Agentic Data Engineering

The paper demonstrates that programmable lakehouses with Git-like branching and declarative environments enable untrusted AI agents to safely repair production data pipelines without human-in-the-loop bottlenecks.

Core Problem

Lakehouses run sensitive workloads that resist automation because they lack safe abstractions for untrusted agents to modify production data without risking corruption or security breaches.

Why it matters:

Data engineers spend significant time fixing broken pipelines, a high-stakes task that is currently hard to automate safely
Current systems lack unified interfaces, requiring agents to navigate heterogeneous tools (SQL editors, Terraform, Docker) rather than a single code-based API
Allowing autonomous agents to write to production storage poses severe trust and correctness risks unless writes are sandboxed and verified

Concrete Example: A data pipeline fails due to a NumPy 2.0 / pandas 2.0 dependency mismatch. An agent attempting to fix this in a traditional setup might accidentally corrupt production tables or introduce malicious code while trying to patch the environment.

Key Novelty

The Programmable, Branch-Based Agentic Lakehouse

Treats the entire data lifecycle (pipelines, environments, infrastructure) as code accessible via APIs, creating a unified interface for agents
Uses 'Git-for-Data' semantics (branch-then-merge) to let agents repair pipelines on isolated data copies (branches), preventing dirty reads in production
Implements a 'proof-carrying' protocol where agents must satisfy a semantic correctness check (verifier function) before their branch is merged

Architecture

The agentic loop workflow interacting with the programmable lakehouse

Evaluation Highlights

Demonstrates fully autonomous repair of a broken pipeline (caused by NumPy/pandas version mismatch) using Sonnet 4.5 via a ReAct loop
Validates safety: failed agent attempts (e.g., GPT-5-mini) caused no production data corruption due to branch isolation
Shows feasibility of 'proof-carrying' workflow where a verifier function automatically gates the merge of agent-generated data into production

Breakthrough Assessment

7/10

Strong conceptual contribution defining safety abstractions for data agents. The prototype is a feasibility demonstration rather than a large-scale benchmark, but the architectural insights on 'Git-for-Data' for agents are significant.

⚙️ Technical Details

Problem Definition

Setting: Autonomous repair of faulty data pipelines in a cloud lakehouse environment

Inputs: A failed pipeline execution (logs, code, error trace)

Outputs: A corrected pipeline branch that passes verification tests and can be merged to production

Pipeline Flow

Observe (fetch logs/errors) → Reason (diagnose issue) → Act (modify code/environment) → Verify (run on branch) → Merge (promote to production)
Group: Agentic Loop → Lakehouse Execution → Verification

System Modules

Agent Framework

Orchestrates reasoning and tool use via the ReAct paradigm

Model or implementation: smolagents (library) with LLMs like Claude 3.5 Sonnet or GPT-4o

Bauplan MCP Server

Exposes lakehouse capabilities as tools to the agent

Model or implementation: Bauplan API wrapper

Programmable Lakehouse (Bauplan)

Executes data transformations in isolated branches

Model or implementation: Serverless FaaS runtime with copy-on-write storage

Verifier

Checks correctness of the output tables before merge

Model or implementation: Deterministic Python function (Branch → bool)

Novel Architectural Elements

Integration of Git-for-Data branching directly into the agentic loop as a safety sandbox
Use of 'proof-carrying' verifiers as an automated gatekeeper for agent-generated data changes

Modeling

Base Model: Claude 3.5 Sonnet (primary successful model in demo), also tested GPT-5-mini

Comparison to Prior Work

vs. Snowflake/dbt: Adds programmable isolation (branches) and unified code interface, enabling agents to safely retry on production data without side effects
vs. General Code Agents (e.g. Devin): Specifically models the *data* lifecycle (state), not just the code logic, ensuring data correctness via verifiers

Limitations

Prototype relies on a specific commercial lakehouse (Bauplan) for the branching implementation
Limited experimental scope (single repair scenario: NumPy/pandas mismatch)
Does not address massive parallelism or complex multi-table dependency graphs in depth
Cost and latency of the agentic loop for complex debugging are not analyzed

Reproducibility

Code: https://github.com/BauplanLabs/the-agentic-lakehouse

📊 Experiments & Results

Evaluation Setup

Proof-of-concept repair of a failing pipeline in a live lakehouse environment

Benchmarks:

Pipeline Repair Scenario (Automated Debugging) [New]

Metrics:

Success/Failure of repair
Safety (no corruption of production)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Feasibility demonstrations showing agents can safely interact with production data infrastructure.
Pipeline Repair Scenario	Success Rate	Successful	Successful	0
Pipeline Repair Scenario	Production Corruption Incidents	0	0	0

Experiment Figures

Visual representation of Git-for-Data branching logic (Run 1 vs Run 2)

Main Takeaways

Programmable lakehouses act as natural agentic environments because code + APIs provide a universal interface
Branch-then-merge semantics are critical for safety, effectively sandboxing agent actions from production state
Frontier models (Sonnet 4.5) can autonomously navigate complex debugging loops involving logs, code edits, and data checks
MCP (Model Context Protocol) is necessary but not sufficient; the underlying infrastructure must support isolation (branching) to be truly agent-safe

📚 Prerequisite Knowledge

Prerequisites

Understanding of data lakehouses and ETL pipelines
Familiarity with Git concepts (branch, commit, merge)
Basic knowledge of AI agents (ReAct loops, tool usage)

Key Terms

programmable lakehouse: A data architecture where all aspects (data, infra, pipelines) are exposed and manageable via code/APIs rather than GUI or disparate tools

Git-for-Data: Applying version control concepts (commits, branches, merges) to large-scale data tables, allowing isolated experimentation and atomic updates

proof-carrying code: A safety concept where untrusted code is accompanied by a formal proof or evidence (here, passing a verifier function) that it satisfies safety properties

MCP: Model Context Protocol—a standard for exposing server-side tools and data context to LLM agents

ReAct: Reason+Act—a paradigm where agents interleave reasoning steps with tool execution steps to solve complex tasks

copy-on-write: A storage optimization where data is copied only when modified, allowing efficient branching without duplicating the entire dataset initially

DAG: Directed Acyclic Graph—a representation of data pipelines where nodes are transformations and edges are dependencies