ACECode: A Reinforcement Learning Framework for Aligning Code Efficiency and Correctness in Code Language Models

📝 Paper Summary

Code Generation LLM Alignment

ACECode aligns CodeLLMs to generate both efficient and correct code using a reinforcement learning framework driven by a training-free reward signal derived directly from code execution runtimes and test cases.

Core Problem

While CodeLLMs generate functionally correct code, the resulting solutions are often highly inefficient (3-13x slower than human code), and existing optimization methods either require complex execution environments during inference or sacrifice correctness for speed.

Why it matters:

Inefficient code hinders software performance and competitiveness, especially in resource-constrained environments like mobile devices and IoT systems.
Optimizing code efficiency contributes to environmental sustainability by reducing energy consumption and carbon footprints of software products.
Existing solutions like SOAP double inference time due to iterative execution, while PIE sacrifices correctness (functional accuracy) to achieve efficiency gains.

Concrete Example: A CodeLLM might generate a functionally correct sorting algorithm that is significantly slower (e.g., O(n^2)) than a human-written reference (e.g., O(n log n)). Previous methods like PIE might tune the model to produce the faster algorithm but introduce bugs that fail edge cases, whereas SOAP would require running the slow code first to get feedback.

Key Novelty

ACECode (Aligning Code Correctness and Efficiency)

Introduces a training-free reward mechanism that uses actual execution feedback (compiler status, test pass rate, and runtime comparison vs. reference) instead of a learned reward model.
Uses a step-function reward design that penalizes incorrect code while adaptively rewarding correct code based on how much faster it is compared to a human-written reference.
Optimizes the CodeLLM via PPO (Proximal Policy Optimization) to simultaneously maximize correctness and efficiency without requiring test cases during the final inference stage.

Architecture

The architecture of ACECode, illustrating the interaction between the Actor LLM, Critic LLM, and the Rewarder.

Evaluation Highlights

Improves pass@1 (correctness) by 1.84% to 14.51% compared to original CodeLLMs across four state-of-the-art models.
Reduces runtime in 65% to 72% of generated solutions compared to original CodeLLMs.
Outperforms the PIE baseline (instruction tuning for efficiency) by up to 14.41% in pass@1 and 11.45% in average execution time.

Breakthrough Assessment

8/10

Significantly advances code generation by successfully optimizing dual conflicting objectives (speed vs. correctness) without requiring inference-time execution or labeled preference datasets.

⚙️ Technical Details

Problem Definition

Setting: Sequence-to-sequence generation where natural language instruction I is mapped to code C

Inputs: Natural language instruction I

Outputs: Code snippets C that are both functionally correct (Gc) and efficient (Ge)

Pipeline Flow

Code Generation: Actor LLM generates N code solutions for a prompt
Execution & Evaluation: Execute code on test cases to get correctness status and runtime
Reward Calculation: Compute scalar reward based on correctness and runtime ratio vs. reference
Policy Optimization: Update Actor LLM using PPO based on rewards

System Modules

Actor LLM

Generate code snippets based on natural language instructions

Model or implementation: Various CodeLLMs (e.g., Code Alpaca, WizardCoder)

Rewarder

Calculate reward signal based on execution feedback

Model or implementation: Deterministic function (Training-free)

Critic LLM

Estimate value function to stabilize PPO training

Model or implementation: Initialized from Actor LLM

Novel Architectural Elements

Integration of a training-free, execution-based reward function directly into the PPO loop for CodeLLMs
Step-function reward design explicitly coupling runtime efficiency ratios with functional correctness checks

Modeling

Base Model: Evaluated on 4 SOTA CodeLLMs (specific names not listed in snippet but implied as Code Alpaca, WizardCoder, etc.)

Training Method: Reinforcement Learning (PPO)

Objective Functions:

Purpose: Maximize expected reward of generating correct and efficient code.

Formally: Maximize E[R(Ge(C), Gc(C))]
Purpose: PPO Loss to update policy while preventing large deviations.

Formally: L_PPO(theta) = E[min(r_t(theta)A_t, clip(...)A_t)]

Key Hyperparameters:

temperature: 0.85 (during generation for diversity)
penalty_factor_k: Controls sensitivity to efficiency gain (formula parameter)
n_min: Minimum executions for runtime stability
+ 1 more
t_max: Maximum allowable accumulated runtime

Compute: Not reported in the paper

Comparison to Prior Work

vs. SOAP: ACECode removes the need for execution/test cases during inference; SOAP requires them.
vs. PIE: ACECode optimizes for *both* correctness and efficiency via RL; PIE optimizes efficiency via instruction tuning and often degrades correctness.
vs. RLHF (Standard): ACECode uses a training-free, objective execution reward; Standard RLHF requires training a reward model on human-labeled data.

Limitations

Depends on the availability of ground-truth test cases and reference solutions during training.
Runtime measurements can still be noisy despite using 'timeit' and repeated execution.
Requires a compilable/executable environment during the training phase (though not inference).

Reproducibility

Code availability is not explicitly provided in the text. The paper mentions extending EffiBench with 10 additional ground-truth solutions per task. Uses Python 'timeit' for runtime measurement.

📊 Experiments & Results

Evaluation Setup

Code generation on efficiency benchmarks

Benchmarks:

EffiBench (Extended) (Efficient Code Generation) [New]

Metrics:

pass@1 (Functional Correctness)
Execution Runtime (Efficiency)
Runtime Reduction Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ACECode consistently improves functional correctness (pass@1) compared to original, instruction-tuned, and PIE-tuned baselines.
EffiBench (Extended)	pass@1 improvement vs Original	0.0	14.51	+14.51
EffiBench (Extended)	pass@1 improvement vs Instruction Tuning	0.0	51.15	+51.15
EffiBench (Extended)	pass@1 improvement vs PIE	0.0	14.41	+14.41
ACECode significantly reduces the execution time of generated code compared to baselines.
EffiBench (Extended)	Runtime Reduction Frequency	0	72	+72
EffiBench (Extended)	Average Execution Time improvement vs Instruction Tuning	0.0	23.18	+23.18
EffiBench (Extended)	Average Execution Time improvement vs PIE	0.0	11.45	+11.45

Main Takeaways

ACECode solves the trade-off between correctness and efficiency, improving both simultaneously unlike PIE which sacrifices correctness.
The training-free reward mechanism successfully guides the model without needing expensive human annotation.
In-context learning with formatted examples helps align the model output for the reward calculation steps.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Code generation benchmarks (HumanEval, EffiBench)
LLM fine-tuning techniques

Key Terms

CodeLLMs: Large Language Models specifically fine-tuned on code datasets to perform programming tasks

RLHF: Reinforcement Learning from Human Feedback—a method to align LLM outputs with specific goals using reward signals

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates model policies in stable, bounded steps

ACECode: Aligning Code Correctness and Efficiency—the proposed framework using RL and execution feedback

EffiBench: A benchmark dataset designed to evaluate the execution efficiency of code generated by LLMs

pass@1: A metric measuring the percentage of problems where the first generated code solution is functionally correct

Actor-Critic: An RL architecture where the 'Actor' generates actions (code) and the 'Critic' estimates the value of those actions to guide training

PIE: A baseline method that improves code efficiency via instruction tuning on a dataset of efficient code snippets

SOAP: A baseline method that uses a two-stage inference process with execution feedback to optimize code

Instruction Tuning: Fine-tuning LLMs on datasets of (instruction, response) pairs to improve their ability to follow tasks