Execution-based Code Generation using Deep Reinforcement Learning

📝 Paper Summary

Code Generation Reinforcement Learning for Code Program Synthesis

PPOCoder optimizes code generation models using PPO with a multi-component reward incorporating compiler feedback, AST matching, and Data Flow Graph alignment to ensure syntactic and functional correctness.

Core Problem

Pre-trained language models for code rely on supervised token-matching objectives, which often fail to ensure generated code is compilable or functionally correct.

Why it matters:

Models like CodeBERT and CodeGPT frequently generate non-compilable code (high syntax error rates)
Standard cross-entropy loss does not capture non-differentiable properties like passing unit tests or compiler checks
Existing RL methods for code are often task-specific (e.g., only synthesis) or language-specific (e.g., only Python), lacking a generalizable framework

Concrete Example: A model might generate code that looks correct token-by-token but misses a variable definition or bracket, causing compilation failure. Standard supervised learning penalizes this only slightly (per token), whereas a compiler rejects it entirely.

Key Novelty

PPOCoder (PPO for Code Generation)

Combines Proximal Policy Optimization (PPO) with a specialized reward function that integrates discrete compiler feedback (pass/fail) with dense structural rewards
Uses Abstract Syntax Tree (AST) matching and Data Flow Graph (DFG) alignment as reward signals to guide syntactic and semantic structure
Replaces standard cross-entropy regularization with a KL-divergence penalty during RL fine-tuning to prevent catastrophic forgetting while reducing memorization

Architecture

The RL training loop of PPOCoder. It details the Actor-Critic setup where the Policy (Actor) generates code, receives a multi-component reward (Compiler, AST, DFG, KL), and updates via PPO.

Evaluation Highlights

Increases Code Completion compilation rate from 52.14% (CodeT5) to 97.68% on CodeSearchNet Python
Achieves 82.80% compilation rate on Java Code Translation (XLCoST), improving over CodeT5 baseline (59.81%) by +22.99%
Outperforms CodeRL on MBPP zero-shot program synthesis (68.2% vs 63.0% pass@80), showing better generalization

Breakthrough Assessment

8/10

Significant improvement in compilability across multiple languages and tasks. Successfully integrates multiple structural signals into RL, addressing the key weakness of LLMs in code generation.

⚙️ Technical Details

Problem Definition

Setting: Sequential discrete finite-horizon Markov Decision Process (MDP) for code generation

Inputs: Source data x (Natural Language description or Source Code)

Outputs: Generated code sequence y (target programming language)

Pipeline Flow

Policy Network (Actor) samples code
Environment (Compiler/Tests + Structure Analyzer) calculates reward
Critic Network estimates value
PPO Update

System Modules

Actor (Policy Network)

Generates code tokens sequentially conditioned on input

Model or implementation: Initialized from pre-trained CodeT5

Reward Computer

Calculates composite reward based on execution and structure

Model or implementation: External tools (Compiler, AST parser, DFG extractor)

Critic (Value Network)

Estimates expected return to compute advantage

Model or implementation: Dense value head on top of PL model hidden states

Novel Architectural Elements

Integration of structural rewards (AST sub-tree matching and DFG edge matching) directly into the PPO reward function
Reward formulation combining discrete compiler signals (sparse) with dense structural matching scores and KL-penalty

Modeling

Base Model: CodeT5 (220M and 770M variants used)

Training Method: Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Maximize expected reward while staying close to old policy.

Formally: L_CPI = E[min(ratio * A, clip(ratio, 1-eps, 1+eps) * A)]
Purpose: Minimize error in value estimation.

Formally: L_VF = (V - V_target)^2
Purpose: Prevent deviation from pre-trained priors.

Formally: R_kl = log(pi) - log(rho)

Training Data:

CodeSearchNet (Python) for Code Completion
XLCoST (6 languages) for Code Translation
APPS (Python) for Program Synthesis

Key Hyperparameters:

clip_range_epsilon: Not explicitly reported in the paper
kl_penalty_beta: Not explicitly reported in the paper
discount_rate_gamma: Not explicitly reported in the paper
+ 1 more
top_k_sampling: 5

Compute: Not reported in the paper

Comparison to Prior Work

vs. CodeRL: Uses PPO (stable trust region) vs. REINFORCE; includes AST/DFG structural rewards vs. only unit tests; uses KL penalty vs. Cross-Entropy regularization
vs. CompCoder: Uses RL for optimization vs. compiler feedback for iterative repair/filtering [not cited in paper]
vs. CodeT5 (fine-tuned): Adds execution-based RL optimization vs. supervised fine-tuning only

Limitations

Computationally expensive due to RL optimization loop (sampling + execution + updates)
Requires compilable/executable environment and test cases (or parallel executable targets) for rewards
May not improve metrics not directly targeted by RL (e.g., readability) unless correlated with structure

Reproducibility

Code: https://github.com/reddy-lab-code-research/PPOCoder

Code is publicly available at https://github.com/reddy-lab-code-research/PPOCoder. Specific hyperparameters like learning rate, clip epsilon, and KL beta are not explicitly detailed in the text.

📊 Experiments & Results

Evaluation Setup

Code generation tasks across multiple languages

Benchmarks:

CodeSearchNet (Code Completion (Python))
XLCoST (Code Translation (7 language pairs))
APPS (Program Synthesis (Python))
MBPP (Program Synthesis (Zero-shot transfer))

Metrics:

Compilation Rate
Exact Match (xMatch)
CodeBLEU
pass@k (Functional Correctness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CodeSearchNet (Completion)	Compilation Rate	52.14	97.68	+45.54
Code Translation results on XLCoST show consistent improvements in compilation rate across multiple target languages.
XLCoST (Translation to Java)	Compilation Rate	59.81	82.80	+22.99
XLCoST (Translation to Python)	Compilation Rate	74.11	88.72	+14.61
XLCoST (Translation to C++)	Compilation Rate	80.17	81.14	+0.97
APPS (Synthesis)	pass@1 (All)	1.30	1.74	+0.44
MBPP	pass@80	63.0	68.2	+5.2

Main Takeaways

PPOCoder consistently achieves significantly higher compilation rates than supervised baselines across C++, Java, Python, C#, PHP, and C.
Structural rewards (AST and DFG) are crucial; ablation shows combining them yields the best performance compared to using compiler signal alone.
Using KL-divergence penalty instead of cross-entropy loss during RL fine-tuning reduces memorization and improves zero-shot transfer (demonstrated on MBPP).
PPO optimization proves more stable and effective than REINFORCE (Actor-Critic without trust region) in ablation studies.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Actor-Critic)
Code representations (AST, DFG)
Transformer-based Language Models (CodeT5)

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that optimizes a policy using a clipped objective function to ensure stability

AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code

DFG: Data Flow Graph—a graph representation showing dependencies between variables in code

pass@k: Metric calculating the probability that at least one of k generated code samples passes all unit tests

CodeBLEU: A metric for code evaluation that combines BLEU (n-gram match) with weighted n-grams, AST match, and data-flow match

KL divergence: Kullback-Leibler divergence—used here as a penalty to prevent the RL-tuned model from deviating too far from the pre-trained base model

Actor-Critic: RL architecture with an Actor (policy) network that generates actions and a Critic (value) network that estimates expected returns