Towards a Neural Debugger for Python

📝 Paper Summary

Self-evolving Agentic reasoning Tool-use post-training

Neural debuggers are language models trained on execution traces to simulate interactive debugging actions like stepping and breakpoints, enabling both forward and inverse program execution prediction.

Core Problem

Existing neural interpreters execute code strictly line-by-line, failing to model the interactive, non-sequential usage of debuggers (breakpoints, stepping over) that developers rely on.

Why it matters:

Developers rarely execute programs sequentially; they jump between relevant states to isolate bugs, a behavior current models cannot simulate
Traditional debugging requires a live execution environment, which is often unavailable or restricted in synthesis and testing scenarios
Agentic coding systems lack a world model for debugging that allows them to plan, verify, and repair code without expensive re-execution

Concrete Example: A developer sets a breakpoint at a specific line inside a loop to inspect a variable. A standard neural interpreter must predict every intermediate line before reaching that point, whereas a developer (and a neural debugger) jumps directly to the state of interest.

Key Novelty

Neural Debugger as an MDP

Models the debugger as a Markov Decision Process where states are program locations/variables and actions are debugger commands (step_into, step_over, breakpoint)
Constructs a state tree from execution traces to define valid transitions, allowing the model to learn non-sequential jumps and call-stack navigation
Supports inverse execution (predicting past states or inputs from a current state) by reversing the transition tree and repurposing actions

Architecture

The data pipeline for creating neural debuggers: from execution traces to state trees, trajectory sampling, and final tokenization.

Evaluation Highlights

The 32B-parameter neural debugger achieves >90% accuracy in predicting the next state across key actions (step into, step over, step return, breakpoint)
Fine-tuned 32B model achieves 83.2 pass@1 on CruxEval output prediction, significantly outperforming base models
1.8B model trained from scratch achieves 53.6 pass@1 on CruxEval input prediction, demonstrating strong inverse execution capabilities

Breakthrough Assessment

8/10

Novel formulation of execution prediction as an interactive debugging MDP. Enables capabilities like inverse execution and constant-time jumps that standard neural interpreters lack.

⚙️ Technical Details

Problem Definition

Setting: Modeling a debugger as a Markov Decision Process (MDP) defined by tuple (S, A, P, R, s0)

Inputs: Source code context and a sequence of debugger history (previous states and actions)

Outputs: Next program state (variables, source line, event type) conditioned on the chosen debugger action

Pipeline Flow

Trace Collection (via sys.settrace)
State Tree Construction (Forward & Inverse)
Trajectory Sampling (Stochastic Policy)
Tokenization & Model Training

System Modules

Trace Collector

Execute Python code and record stack frames, events, and variables using sys.settrace

Model or implementation: Standard Python Interpreter instrumented with custom trace function

State Tree Builder (Data Processing)

Reconstruct call stack hierarchy from sequential traces to define valid debugger transitions (step over vs step into)

Model or implementation: Algorithmic Tree Construction

Trajectory Sampler (Data Processing)

Sample sequences of (state, action) pairs from the state tree using a stochastic policy to mimic diverse debugging sessions

Model or implementation: Stochastic Policy (Mixed Categorical)

Neural Debugger Model

Predict the next program state given code context and history of states/actions

Model or implementation: Transformer LLM (1.8B scratch, 32B finetuned)

Novel Architectural Elements

State Tree representation of execution traces to mathematically define debugger transitions (step over/return) as tree traversals
Inverse State Tree formulation enabling unified training for forward and inverse execution within the same framework

Modeling

Base Model: CWM (32B parameters) for finetuning; Unnamed 1.8B Transformer for pre-training

Training Method: Supervised Fine-Tuning (SFT) and Pre-training

Objective Functions:

Purpose: Predict the next token in the debugger trajectory sequence.

Formally: Standard cross-entropy loss over the tokenized state-action sequences.

Adaptation: Full fine-tuning (for 32B model)

Training Data:

15B tokens of repository-level trajectories
100B tokens of function-level trajectories
Derived from CWM dataset (120M functions, 21k repos)
Trajectories sampled using stochastic policy (Table A.1 in paper)

Key Hyperparameters:

total_tokens_scratch_1.8B: 150B
context_length: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. CWM: CWM predicts strictly sequential execution; Neural Debugger predicts conditional execution based on actions (step over, breakpoint) and supports inverse execution
vs. Nye et al.: Neural Debugger models non-sequential access (breakpoints) and full variable states rather than just line visits
vs. Reverse Debuggers (traditional): Neural Debugger can infer predecessor states from arbitrary points without a prior forward run (generative inverse execution) vs. deterministic replay

Limitations

Inverse execution is inherently ambiguous (many-to-one mapping), requiring sampling rather than deterministic prediction
Requires re-executing data generation pipeline to train; not a zero-shot capability of standard LLMs
Does not currently optimize for a reward function (RL) to act as an autonomous debugging agent, solely focuses on state prediction

Reproducibility

No code or model weights provided in the paper. The dataset construction relies on the CWM (Code World Model) dataset. Implementation details of the trace collector and state tree builder are described textually.

📊 Experiments & Results

Evaluation Setup

Next-state prediction accuracy on held-out debugger trajectories and transfer evaluation on CruxEval benchmarks

Benchmarks:

Internal Debugger Trace Test Set (Next-state prediction) [New]
CruxEval (Input Prediction and Output Prediction)

Metrics:

Next State Accuracy (Match of Line, Event, Locals)
pass@1 (CruxEval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal Debugger Trace Test Set	Next State Accuracy (Line)	Not applicable	>90%	Not applicable
CruxEval performance demonstrates the model's ability to transfer debugger training to general code reasoning tasks.
CruxEval Output Prediction	pass@1	Not reported in the paper	83.2	Not reported in the paper
CruxEval Output Prediction	pass@1	Not applicable	57.7	Not applicable
CruxEval Input Prediction	pass@1	Not applicable	53.6	Not applicable

Experiment Figures

Visualization of the State Tree and transition dynamics for both forward and inverse execution.

Statistics of the generated training data, comparing function-level and repository-level traces.

Main Takeaways

Neural debuggers can accurately simulate interactive debugger actions (stepping, breakpoints) with high accuracy (>90%)
Training on debugger actions enables strong inverse execution capabilities (inferring inputs from outputs), as shown by CruxEval Input Prediction results
Stochastic trajectory sampling effectively serves as data augmentation, allowing robust training on finite execution traces
The approach generalizes well to standard code reasoning benchmarks (CruxEval) without explicit training on those specific tasks

📚 Prerequisite Knowledge

Prerequisites

Understanding of Python execution model (stack frames, bytecode)
Familiarity with standard debugger operations (step into, step over, breakpoints)
Basic knowledge of Transformer language models and next-token prediction

Key Terms

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

execution trace: A record of the sequence of states (variable values, current line) a program goes through during execution

step_over: A debugger action that executes the current line; if the line is a function call, the function is executed in its entirety and the debugger pauses at the next line in the current scope

step_into: A debugger action that enters the function called at the current line, pausing at the first line inside that function

inverse execution: Inferring plausible previous program states or input arguments given a current program state (essentially running the code backwards)

CruxEval: A benchmark for evaluating code reasoning, specifically focusing on input prediction (generating inputs that yield a given output) and output prediction

sys.settrace: A Python function that allows registering a callback to monitor execution events (call, line, return, exception), used here to collect data