← Back to Paper List

Agentic Code Reasoning

Shubham Ugare, Satish Chandra
Meta
arXiv (2026)
Agent Reasoning Benchmark RL

📝 Paper Summary

Code Verification Static Analysis Agentic Reasoning
Semi-formal reasoning prompts agents to generate structured 'certificates' containing explicit execution traces and premises, enabling accurate semantic code analysis without requiring actual code execution.
Core Problem
LLM agents often guess about code behavior using unstructured reasoning, failing to verify semantic equivalence or identify subtle bugs when execution environments (sandboxes) are unavailable or too costly.
Why it matters:
  • Execution-based verification (sandboxing) is computationally expensive and slow for RL training pipelines
  • Formal verification methods are too brittle and difficult to scale to arbitrary multi-language repositories
  • Standard Chain-of-Thought allows agents to skip critical logic steps or hallucinate library behaviors
Concrete Example: In a Django issue (django-13670), standard reasoning incorrectly assumes `format()` is a built-in function and claims two patches are equivalent. Semi-formal reasoning traces the function call, discovering it is shadowed by a utility in `dateformat.py` that expects a `datetime` object, correctly identifying that one patch raises an `AttributeError`.
Key Novelty
Semi-formal Reasoning with Certificates
  • Replaces unstructured Chain-of-Thought with a structured 'certificate' template that the agent must fill out
  • Enforces explicit documentation of premises (facts) and step-by-step execution traces for specific test cases before the agent can state a conclusion
  • Requires the agent to perform interprocedural analysis (following function calls across files) to fill the template, preventing superficial guessing
Evaluation Highlights
  • Improves patch equivalence verification accuracy from 78% to 88% on curated challenging examples using Opus-4.5
  • Achieves 93% accuracy on real-world agent-generated patches (SWE-bench), approaching the reliability needed for execution-free RL reward signals
  • Boosts code question answering accuracy on RubberDuckBench to 87%, a 10.8 percentage point gain over single-shot baselines
Breakthrough Assessment
8/10
Strong practical contribution for RL code agents. By achieving >90% accuracy in execution-free verification, it unlocks much cheaper training loops. The 'certificate' method is a clever bridge between formal methods and LLM flexibility.
×