← Back to Paper List

Execution-Grounded Credit Assignment for GRPO in Code Generation

Abhijit Kumar, Natalya Kumar, Shikhar Gupta
arXiv (2026)
RL Reasoning

📝 Paper Summary

Code Generation Reinforcement Learning with Verifiable Rewards (RLVR) Credit Assignment
EGCA improves code generation by identifying the first executed token where a near-correct candidate diverges from a reference solution, assigning precise credit instead of uniform rewards.
Core Problem
Standard unit-test rewards are temporally coarse, applying a single pass/fail signal to an entire program rather than the specific decision causing failure.
Why it matters:
  • Modern models often produce syntactically valid and structurally plausible code that fails due to subtle localized semantic errors.
  • Group-based policy gradients (like GRPO) distribute outcome signals uniformly, providing gradients too diffuse to correct these localized reasoning errors.
  • Existing dense feedback methods (like step-level masking) do not distinguish causal errors in fully executed programs.
Concrete Example: A generated program might be structurally correct and execute fully but fail a test because of a single incorrect condition or off-by-one error. Standard GRPO penalizes the entire program sequence equally, failing to pinpoint the specific token responsible for the logic error.
Key Novelty
Execution-Grounded Credit Assignment (EGCA)
  • Routes samples through deterministic gates (syntax/constraint/logic); for logic errors, it compares execution traces against a canonical reference to find the first divergence.
  • Assigns advantage only to the causal token span identified by the divergence and masks all downstream tokens, concentrating the gradient signal where it matters.
  • Operates entirely without a learned critic or auxiliary value function, modifying only the token-level weighting within the standard GRPO objective.
Evaluation Highlights
  • +3.1% pass@1 improvement on HumanEval (82.1%) over vanilla GRPO (79.0%) using DeepSeek-Coder-Instruct-6.7B.
  • +1.5% pass@1 improvement on MBPP (68.9%) over vanilla GRPO (67.4%).
  • Outperforms the stronger 1.5B-parameter debugger model itself by +8.2 points, proving the method extracts localization signals rather than just distilling teacher competence.
Breakthrough Assessment
8/10
Elegantly solves the credit assignment problem in RLVR without training expensive critics. The consistent gains over strong baselines and the demonstration that it surpasses its own debugger make it a significant practical advance.
×