daVinci-Dev: Agent-native Mid-training for Software Engineering

📝 Paper Summary

Code agents Software engineering automation LLM training methodologies

daVinci-Dev introduces agentic mid-training using large-scale GitHub Pull Requests reconstructed as agent-native trajectories, enabling models to internalize iterative software engineering workflows before fine-tuning.

Core Problem

Current code models suffer from a distribution mismatch: they are trained on static code snapshots but must operate as dynamic agents that navigate, edit, and test repositories iteratively.

Why it matters:

Post-training (SFT/RL) alone is insufficient because high-quality agent trajectories are scarce and expensive to collect at scale
Static training data obscures the decision process (how files were found, why edits were made), leaving models unprepared for the causal dependencies of real development
Existing mid-training approaches often factorize tasks (separating localization from editing), breaking the natural action-observation loop required for autonomous engineering

Concrete Example: A standard training sample shows a final committed file change. It misses the agent's struggle: searching for 'parse_date', reading 'utils/date.py', failing a test, reading the error log, and then revising the code. Models trained only on the final file don't learn this debugging loop.

Key Novelty

Agent-Native Mid-Training (daVinci-Dev)

Reconstructs 'contextually-native' trajectories from 68.6B tokens of GitHub Pull Requests by bundling issue descriptions, retrieved file context, and sequential edits into a single coherent workflow
Augments this with 'environmentally-native' trajectories (3.1B tokens) collected from real agent rollouts in Docker containers, capturing authentic execution feedback (tests, errors) that static data misses

Evaluation Highlights

Achieves 58.5% Pass@1 on SWE-Bench Verified with a 72B model, surpassing the previous best open recipe (Kimi-Dev) of 48.6%
The 32B model reaches 56.1% Pass@1, setting a state-of-the-art for open recipes at this scale, even outperforming some larger models
Zero-shot agentic capability (without SFT) jumps from ~43.7% to 54.8% when mixing PR-based data with trajectory data, showing strong synergy

Breakthrough Assessment

9/10

Significantly advances open-source code agent capabilities by formalizing agentic mid-training. The shift from static code pre-training to process-oriented mid-training is a scalable and highly effective paradigm shift.

⚙️ Technical Details

Problem Definition

Setting: Agentic software engineering task (R, q, E), where R is repository state, q is issue description, E is evaluation oracle (tests).

Inputs: Repository codebase and a natural language issue description.

Outputs: A sequence of actions (navigation, editing, testing) leading to a patch that resolves the issue.

Pipeline Flow

Data Construction: GitHub PRs → Contextually-native data (D_ctx)
Data Construction: Agent Rollouts → Environmentally-native data (D_env)
Mid-Training: Base Model → daVinci-Dev (MT)
Post-Training: daVinci-Dev → SFT on Agent Trajectories

System Modules

Data Constructor (Contextual) (Data Synthesis)

Reconstructs PRs into agent workflows: Issue → File Search (simulated via diffs) → Read → Edit sequences

Model or implementation: Qwen2.5-72B-Instruct (used for summarization)

Data Constructor (Environmental) (Data Synthesis)

Runs agents in Docker to collect real execution traces (Pass & Fail)

Model or implementation: GLM-4 (agent policy)

Mid-Training

Continual pre-training on the synthesized agent-native corpora

Model or implementation: Qwen2.5-Base (32B/72B)

Novel Architectural Elements

Pipeline topology that integrates Environmentally-native trajectories (execution feedback) into the Mid-Training stage, rather than reserving them for SFT or RL

Modeling

Base Model: Qwen2.5-72B-Base and Qwen2.5-32B-Base

Training Method: Mid-training (next-token prediction) followed by Supervised Fine-Tuning (SFT)

Training Data:

D_ctx (Contextually-native): 68.6B tokens from GitHub PRs
D_env (Environmentally-native): 3.1B tokens from agent rollouts
SFT Data: D_env_pass (0.7B) or SWE-smith (0.11B)

Key Hyperparameters:

max_length: 32k (PR data), 128k (Trajectories)
upsample_rate: 3x for D_env_pass

Compute: Not reported in the paper

Comparison to Prior Work

vs. KIMI-DEV: Uses bundled 'contextually-native' PR data instead of factorized tasks; integrates execution trajectories into MT; uses <50% of the training tokens (73B vs 150B)
vs. SWE-Agent-LM: Applies mid-training to a general base model (Qwen2.5-Base) rather than just fine-tuning a coder model
vs. CodeAct [not cited in paper]: Focuses on mid-training data synthesis rather than just agent interaction format/prompting

Limitations

Privacy concerns: Developer identifiers in PR text were not explicitly removed.
Evaluation sensitivity: Results depend on a patched SWE-Bench harness, introducing potential variance.
Scope: Evaluated primarily on Qwen2.5 family and SWE-Bench; generalization to other architectures is untested.

Reproducibility

Code: https://github.com/SII-GAIR/daVinci-Dev

Artifacts promised: code, datasets, and checkpoints to be open-sourced. Missing: specific compute hours/resources used for the mid-training run. Closed-source dependencies: Data construction uses Qwen2.5-Instruct and GLM-4, but the final models are open weights.

📊 Experiments & Results

Evaluation Setup

Agentic software engineering tasks where an agent must resolve GitHub issues.

Benchmarks:

SWE-Bench Verified (Repository-level code generation and debugging)
HumanEval (Function-level code generation)
GPQA (Scientific reasoning QA)

Metrics:

Pass@1 (resolution rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on SWE-Bench Verified showing daVinci-Dev outperforms baselines across model sizes.
SWE-Bench Verified	Pass@1	48.6	58.5	+9.9
SWE-Bench Verified	Pass@1	53.0	56.1	+3.1
Generalization capabilities on standard coding and scientific reasoning benchmarks.
HumanEval	Pass@1	58.16	81.42	+23.26
GPQA-Main	Accuracy	43.30	44.87	+1.57

Experiment Figures

Scaling law of agent-native mid-training: Pass@1 on SWE-Bench Verified vs Training Steps.

Main Takeaways

Contextually-native PR data (bundling context+edits) is far more token-efficient than factorized approaches, beating Kimi-Dev with half the tokens.
Environmentally-native trajectories (real execution) provide a crucial boost (~1-3%) even when used in Mid-Training, not just SFT.
Agentic mid-training generalizes well, significantly boosting performance on standard code generation (HumanEval) and even scientific reasoning (GPQA).
Scaling laws apply: performance scales log-linearly with the number of training steps on agent-native data.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model (LLM) training stages (Pre-training, Mid-training, Post-training)
Software Engineering workflows (Pull Requests, Commits, Unit Tests)
Agentic frameworks (Action-Observation loops)

Key Terms

Mid-training (MT): An intermediate training stage between pre-training and fine-tuning, using domain-specific data at scale to shift model capabilities

SFT: Supervised Fine-Tuning—training on high-quality demonstrations to teach specific behaviors

Pull Request (PR): A proposal to merge code changes in version control systems like GitHub, containing commits, descriptions, and code diffs

SWE-Bench Verified: A benchmark for evaluating software engineering agents on real-world GitHub issues, consisting of a verified subset of tasks

Contextually-native trajectories: Training data reconstructed from PRs that preserves the logical flow: Issue → Context Retrieval → Edits, simulating an agent's information state

Environmentally-native trajectories: Training data recorded from live agent interactions with a compiler/interpreter, capturing real tool outputs and execution feedback (e.g., test failures)

Pass@1: The percentage of problems solved correctly on the first attempt

Scaffold: The software framework wrapping the LLM that handles tool execution, memory management, and environment interaction (e.g., SWE-AGENT)