← Back to Paper List

daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, Mohan Jiang, Hanning Wang, Jifan Lin, Yang Xiao, Jie Sun, Yunze Wu, Pengfei Liu
SII, Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
arXiv (2026)
Agent Pretraining Benchmark Reasoning

πŸ“ Paper Summary

Code agents Software engineering automation LLM training methodologies
daVinci-Dev introduces agentic mid-training using large-scale GitHub Pull Requests reconstructed as agent-native trajectories, enabling models to internalize iterative software engineering workflows before fine-tuning.
Core Problem
Current code models suffer from a distribution mismatch: they are trained on static code snapshots but must operate as dynamic agents that navigate, edit, and test repositories iteratively.
Why it matters:
  • Post-training (SFT/RL) alone is insufficient because high-quality agent trajectories are scarce and expensive to collect at scale
  • Static training data obscures the decision process (how files were found, why edits were made), leaving models unprepared for the causal dependencies of real development
  • Existing mid-training approaches often factorize tasks (separating localization from editing), breaking the natural action-observation loop required for autonomous engineering
Concrete Example: A standard training sample shows a final committed file change. It misses the agent's struggle: searching for 'parse_date', reading 'utils/date.py', failing a test, reading the error log, and then revising the code. Models trained only on the final file don't learn this debugging loop.
Key Novelty
Agent-Native Mid-Training (daVinci-Dev)
  • Reconstructs 'contextually-native' trajectories from 68.6B tokens of GitHub Pull Requests by bundling issue descriptions, retrieved file context, and sequential edits into a single coherent workflow
  • Augments this with 'environmentally-native' trajectories (3.1B tokens) collected from real agent rollouts in Docker containers, capturing authentic execution feedback (tests, errors) that static data misses
Evaluation Highlights
  • Achieves 58.5% Pass@1 on SWE-Bench Verified with a 72B model, surpassing the previous best open recipe (Kimi-Dev) of 48.6%
  • The 32B model reaches 56.1% Pass@1, setting a state-of-the-art for open recipes at this scale, even outperforming some larger models
  • Zero-shot agentic capability (without SFT) jumps from ~43.7% to 54.8% when mixing PR-based data with trajectory data, showing strong synergy
Breakthrough Assessment
9/10
Significantly advances open-source code agent capabilities by formalizing agentic mid-training. The shift from static code pre-training to process-oriented mid-training is a scalable and highly effective paradigm shift.
×