RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

📝 Paper Summary

Code Generation Reinforcement Learning (RL) Agentic AI

RLEF trains Large Language Models to iteratively repair code using execution feedback via Reinforcement Learning, significantly improving sample efficiency on competitive programming tasks.

Core Problem

LLMs struggle to effectively use execution feedback (error messages) to iteratively improve code; independent sampling often outperforms self-correction for a fixed compute budget.

Why it matters:

Current LLMs often fail to ground generations in concrete inference-time situations, necessitating repeated manual prompting or expensive scaffolding
Utilizing feedback (like compiler errors) is crucial for autonomous agents but has historically failed to yield improvements over simple re-sampling
Existing solutions rely on complex manual scaffolding (e.g., AlphaCodium) or massive sampling budgets rather than inherent model capability

Concrete Example: When an LLM generates code that fails a public test case, a standard model might output the exact same wrong code again or hallucinate a fix that ignores the error message. With RLEF, the model learns to read the specific error (e.g., 'IndexError') and generate a corrected solution in the next turn.

Key Novelty

Reinforcement Learning with Execution Feedback (RLEF)

Treats iterative code generation as a multi-turn Markov Decision Process where the state includes previous code attempts and their execution output (errors/test results)
Optimizes the model end-to-end using PPO with a binary reward signal based on passing held-out tests, rather than just supervised fine-tuning on correct code
Uses a hybrid token-level policy and turn-level value function to efficiently learn from sparse rewards in multi-turn dialogues

Architecture

The iterative code synthesis conversation flow which serves as the environment for the RL agent.

Evaluation Highlights

Llama 3.1 70B with RLEF achieves 54.5% pass@1 on CodeContests (test set), surpassing GPT-4 based AlphaCodium (29%) and prior SFT baselines
Reduces sample requirements by an order of magnitude: 70B model beats AlphaCodium (100 samples) with just a single rollout (3 turns)
Generalizes to HumanEval+ (+12.7% for 8B) and MBPP+ (+5.2% for 70B) despite training only on CodeContests

Breakthrough Assessment

9/10

Significantly advances the state-of-the-art in code generation by making self-repair effective and sample-efficient, surpassing much larger proprietary models and complex scaffolding systems.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) for multi-turn code generation

Inputs: Natural language problem description q

Outputs: Sequence of code solutions a_0, a_1, ... where each a_t attempts to solve q based on feedback from a_{t-1}

Pipeline Flow

Policy (LLM) generates code a_t
Environment executes a_t on public tests
Feedback (errors/results) appended to context
Policy generates correction a_{t+1}

System Modules

Policy Model

Generate code solutions and repairs

Model or implementation: Llama 3.1 8B/70B Instruct

Execution Environment

Run code and provide feedback

Model or implementation: Python 3 Executor

Value Function

Estimate expected future reward

Model or implementation: Scalar head on LLM

Novel Architectural Elements

Hybrid action space for RL: Policy is token-level, but Value function is turn-level (predicting value of whole response from prompt)
End-to-end optimization of the self-repair loop using environmental feedback as state observations

Modeling

Base Model: Llama 3 (3.0 and 3.1) Instruct 8B and 70B

Training Method: Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Maximize reward while staying close to reference model.

Formally: R(s,a) = 1[passed private] - beta * (log pi(a|c) - log rho(a|c)) - penalty[invalid]
Purpose: Optimize policy using advantage estimation.

Formally: Maximize advantage A_t = -V(c_t) + sum(R(s_i, a_i))

Trainable Parameters: Full model fine-tuning

Training Data:

CodeContests training set (12,659 problems used)
Rollouts generated dynamically during RL training

Key Hyperparameters:

updates_8B: 12,000
updates_70B: 8,000
discount_factor_gamma: 1.0
+ 2 more
sampling_temperature: 0.6 (training), 0.2 (eval 1@3), 1.0 (eval 10@100)
top_p: 0.95

Compute: Not reported in the paper

Comparison to Prior Work

vs. AlphaCodium: RLEF internalizes the iterative loop into model weights via training, removing the need for complex inference-time scaffolding
vs. AlphaCode: RLEF achieves comparable or better results with orders of magnitude fewer samples (3 vs 1000+)
vs. Reflexion: RLEF optimizes the policy with RL rather than relying purely on in-context learning/prompting [not cited in paper]

Limitations

Requires a defined set of public test cases for feedback, which may not always be available in real-world user prompts
Training is computationally intensive due to generating rollouts with execution during the inner loop
RL training can reduce output diversity, potentially hurting performance at very high sample budgets (e.g., pass@1000)

Reproducibility

Code availability is not provided. Prompt templates and execution feedback formats are listed in Appendix C. Hyperparameters like learning rates are detailed in Appendix A.

📊 Experiments & Results

Evaluation Setup

Competitive programming problems with natural language descriptions

Benchmarks:

CodeContests (Competitive Programming (Python))
HumanEval+ (Function synthesis)
MBPP+ (Function synthesis)

Metrics:

n@k (pass rate of n solutions selected from k samples)
pass@1
pass@10
Statistical methodology: Estimator from Li et al. (2022) used for pass rates; estimated on 200 rollouts

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
State-of-the-art comparison on CodeContests Test set showing RLEF superiority over baselines.
CodeContests Test	pass@1 (1@3)	29	40.1	+11.1
CodeContests Test	pass@1 (1@3)	27.5	40.1	+12.6
CodeContests Test	pass@1 (1@3)	16.4	40.1	+23.7
Generalization to out-of-domain benchmarks (HumanEval+ and MBPP+).
HumanEval+	pass@1 (Multi-turn)	63.9	69.5	+5.6
MBPP+	pass@1 (Multi-turn)	70.2	72.2	+2.0

Experiment Figures

Comparison of pass rates on CodeContests against sample budget (log scale) for various models.

Analysis of error types and code edit distance (chrF) across turns.

Main Takeaways

RLEF allows models to effectively use execution feedback: ablations with random feedback show performance drops, confirming the model isn't just resampling blindly.
Sample efficiency is drastically improved: RLEF with 3 samples matches or beats methods using 100-1000 samples.
Iterative training is superior to Single-Turn (ST) training or Supervised Fine-Tuning (SFT) on the same data.
Instruction tuning is not strictly necessary before RLEF: RLEF on Llama 3.0 8B (base) improves performance from 3.2% to 12.1%.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Large Language Models (LLMs)
Code generation benchmarks (CodeContests, HumanEval)

Key Terms

RLEF: Reinforcement Learning with Execution Feedback—the proposed method of training LLMs to use execution results (errors, test outputs) to repair code via RL

PPO: Proximal Policy Optimization—an RL algorithm used here to fine-tune the LLM policy

pass@k: A metric measuring the probability that at least one of k generated samples is correct

n@k: Average solve rate: expectation that any of n solutions selected from k samples is correct

CodeContests: A challenging competitive programming dataset with private test cases used for evaluation

public tests: Test cases visible to the model during the iterative process to generate feedback

private tests: Held-out test cases used only for final reward calculation and evaluation, ensuring the model doesn't just overfit to specific inputs

KL penalty: A regularization term preventing the RL-tuned model from deviating too far from the original reference model distribution

SFT: Supervised Fine-Tuning

Instruct model: An LLM fine-tuned to follow instructions, used here as the initialization for RLEF

rollout: One complete episode of interaction (generating code, getting feedback, generating again) up to the turn limit