ACECODER: Acing Coder RL via Automated Test-Case Synthesis

📝 Paper Summary

Code Generation Reinforcement Learning (RL) for Code

AceCoder creates a large-scale dataset of automated test cases to train reliable reward models and perform reinforcement learning, significantly boosting the coding performance of base and instruction-tuned models.

Core Problem

Reinforcement learning (RL) has been underutilized in coding models because reliable reward signals (test cases) are scarce and existing datasets rely on expensive human annotation.

Why it matters:

Evaluating code quality requires execution against test cases, but most large-scale datasets lack sufficient tests for robust reward modeling
Existing reward models often fail to generalize to the coding domain due to the complexity of execution-based evaluation compared to simple text matching
Current state-of-the-art open models lag behind proprietary models (like GPT-4) in coding tasks partly due to the lack of effective RL training pipelines

Concrete Example: In math tasks, a simple string match against an answer key provides a reward. In coding, a generated program might look correct but fail edge cases. Without a suite of test cases (like those in AceCode-87K), an RL model cannot distinguish between a buggy solution and a correct one, preventing effective optimization.

Key Novelty

AceCoder (Automated Test-Case Synthesis for RL)

Constructs a massive dataset (AceCode-87K) by prompting an LLM to 'imagine' test cases for existing coding questions, then filtering them using a strong proxy model to ensure validity
Trains a reward model using execution pass rates on these synthesized tests to define preference pairs (e.g., distinguishing programs with >80% pass rates from those with <10%)
Applies Reinforcement Learning (specifically Reinforce++) using both the trained reward model and direct test-case pass rates to optimize the policy

Architecture

The automated data synthesis and RL pipeline.

Evaluation Highlights

+10 points average improvement for Llama-3.1-8B-Instruct using AceCode-RM-32B Best-of-32 sampling across 4 major benchmarks
+25% improvement on HumanEval-plus and +6% on MBPP-plus when performing RL directly from Qwen2.5-Coder-7B-base with only 80 optimization steps
AceCode-RM-32B achieves a score of 76.1 on RM Bench, outperforming existing state-of-the-art reward models in Coding, Chat, and Hard categories

Breakthrough Assessment

8/10

Significantly democratizes RL for coding by automating the bottleneck (test case generation). The massive gains (+25% on HumanEval+) with minimal steps suggest this is a highly effective recipe for post-training.

⚙️ Technical Details

Problem Definition

Setting: Code generation given a natural language question x, optimized via reinforcement learning using execution feedback

Inputs: Coding question x (refined LeetCode-style instructions)

Outputs: Program solution y = {y_1, ..., y_t}

Pipeline Flow

Test Case Synthesis (GPT-4o-mini generates tests) -> Test Filtering (Qwen2.5-Coder-32B verifies tests)
Reward Model Training (Train RM on pass-rate preference pairs)
Reinforcement Learning (Optimize Policy using RM and execution feedback)

System Modules

Test Synthesizer (Data Construction)

Generate refined questions and 'imagined' test cases from seed datasets

Model or implementation: GPT-4o-mini

Test Filter (Data Construction)

Validate synthesized test cases by checking if a strong model's solution passes them

Model or implementation: Qwen2.5-Coder-32B-Instruct

Reward Model (Inference / Training)

Predict a scalar score representing the quality of a generated code solution

Model or implementation: AceCode-RM-7B (Qwen2.5-Coder-7B) or AceCode-RM-32B (Qwen2.5-Coder-32B)

Policy Model (Inference / Training)

Generate code solutions

Model or implementation: Qwen2.5-Coder-7B-Instruct (or Base)

Novel Architectural Elements

Automated synthesis pipeline that generates validation sets (test cases) rather than just training data
Dual-signal RL integration: utilizing both a learned Reward Model and direct execution pass rates (binary/scalar) for reinforcement learning

Modeling

Base Model: Qwen2.5-Coder-7B and Qwen2.5-Coder-32B

Training Method: Reinforce++ (Policy Optimization)

Objective Functions:

Purpose: Train reward model to distinguish better code.

Formally: Bradley-Terry loss minimizing -log(sigmoid(score_preferred - score_rejected))
Purpose: Optimize policy to maximize expected reward.

Formally: Reinforce++ objective utilizing rewards and KL-divergence directly for advantage estimation without a value network

Adaptation: Full fine-tuning

Training Data:

AceCode-87K: 87.1K questions, 1.38M test cases
Preference pairs constructed by sampling programs from Llama-3.1/Qwen2.5 and comparing pass rates (margin > 0.8 vs < 0.1)

Key Hyperparameters:

optimization_steps: 80 (for Base model RL)
preference_selection_margin: Positive > 0.8 pass rate, Negative < 0.1 pass rate

Compute: 48 H100 GPU hours for the RL training (80 steps)

Comparison to Prior Work

vs. DeepSeek-R1: AceCoder targets general code generation using synthesized unit tests rather than math proofs
vs. Skywork-Reward: AceCode-RM is domain-specific and trained on execution feedback, significantly outperforming Skywork on code tasks
vs. APPS/TACO: AceCoder synthesizes tests automatically from noisy data, whereas APPS/TACO rely on human expert annotation or competition scraping [not cited in paper as direct baseline, but as data source contrast]

Limitations

Relies on a strong teacher model (Qwen2.5-Coder-32B) to filter hallucinations, which creates a dependency
Test case synthesis might still miss subtle edge cases despite filtering
Preference pairs discard solutions with middle-ground pass rates (0.1 to 0.8), potentially wasting training signal

Reproducibility

Code: https://tiger-ai-lab.github.io/AceCoder

The paper explicitly names the dataset AceCode-87K and models AceCode-RM-7B/32B. The website https://tiger-ai-lab.github.io/AceCoder is provided. The method uses open weights (Qwen, Llama) but relies on GPT-4o-mini for data synthesis.

📊 Experiments & Results

Evaluation Setup

Evaluation on standard coding benchmarks using Best-of-N sampling and RL-finetuned models.

Benchmarks:

HumanEval (Python Function Synthesis)
MBPP (Python Function Synthesis)
BigCodeBench (Complex Coding Tasks)
LiveCodeBench (Contamination-free Coding Tasks)
RM Bench (Reward Model Evaluation)

Metrics:

Pass@1
Reward Model Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Best-of-N sampling results demonstrating the effectiveness of the AceCode-RM reward models in selecting correct solutions from samples.
Average (HumanEval, MBPP, BigCodeBench, LiveCodeBench)	Average Score	59.9	68.3	+8.4
Average (HumanEval, MBPP, BigCodeBench, LiveCodeBench)	Average Score	59.9	70.6	+10.7
Average (HumanEval, MBPP, BigCodeBench, LiveCodeBench)	Average Score	72.4	75.0	+2.6
Reinforcement Learning results showing improvements when optimizing the policy directly.
Average (HumanEval, MBPP, BigCodeBench, LiveCodeBench)	Average Score	66.5	68.6	+2.1
HumanEval+	Pass@1	54.3	79.3	+25.0
MBPP+	Pass@1	64.0	70.0	+6.0

Main Takeaways

Automated test synthesis enables scalable RL for code without expensive human annotations
Best-of-N sampling with AceCode-RM yields consistent gains (up to +10 points) across diverse models (Llama, Qwen)
AceCode-RM outperforms general-purpose reward models (Skywork, ArmoRM) on coding benchmarks, validating the need for domain-specific execution feedback
Applying RL directly to base models (bypassing standard SFT) yields massive improvements on hard benchmarks (HumanEval+) in very few steps

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model for preferences
Proximal Policy Optimization (PPO)
Large Language Model (LLM) fine-tuning

Key Terms

AceCode-87K: The curated dataset of 87K coding questions and 1.38M validated test cases created by this paper

Bradley-Terry loss: A probabilistic model used to train reward models by predicting the probability that one response is preferred over another based on their score difference

Best-of-N sampling: A test-time inference strategy where N solutions are generated, and a reward model selects the best one

Reinforce++: A variant of the REINFORCE algorithm that eliminates the need for a separate value model during RL, using KL-divergence and rewards directly for advantage estimation

PPO: Proximal Policy Optimization—an RL algorithm that updates policies within a trusted region to ensure stability

Pass@k: A metric measuring the percentage of problems solved where at least one correct solution is found in k samples

SFT: Supervised Fine-Tuning—training a model on labeled (question, code) pairs

KL-divergence: A statistical measure of how one probability distribution differs from a second, reference probability distribution