CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

📝 Paper Summary

Code Generation Reinforcement Learning from Human Feedback (RLHF) Reward Modeling

CodeScaler is a reward model trained on high-quality on-policy rollouts that enables scalable reinforcement learning and fast test-time scaling for code generation without requiring test case execution.

Core Problem

Reinforcement Learning from Verifiable Rewards (RLVR) relies on high-quality problems with curated test cases, which are scarce and costly; synthetic alternatives lack oracle solutions for reliable verification.

Why it matters:

High-quality verified code problems are limited in scale, bottling the potential of execution-based RL.
Existing general-purpose reward models (RMs) fail to distinguish subtle code correctness issues, leading to reward hacking and instability during RL.
Unit-test based test-time scaling methods have high latency due to the need for code execution.

Concrete Example: Training on the larger synthetic KodCode dataset (447K problems) yields worse performance (11.51 avg) than the smaller verified DeepCoder dataset (14.41 avg) because synthetic test cases miss corner cases, allowing incorrect solutions to pass.

Key Novelty

Syntax-Aware, Validity-Preserving Code Reward Model

Trains a reward model on preference data derived from on-policy rollouts of an RLVR-trained model, ensuring diversity and relevance.
Implements syntax-aware code extraction to filter invalid code before scoring, preventing the model from rewarding gibberish.
Applies a validity-preserving reward shaping function that maps scores to a positive range and penalizes extraction failures, ensuring a stable optimization landscape.

Architecture

Overview of CodeScaler training and application pipeline. Left: Preference data construction from on-policy rollouts. Middle: RM training. Right: Application in RL (with validity-preserving shaping) and Inference (Best-of-N).

Evaluation Highlights

+11.72 points average improvement for Qwen3-8B-Base across five coding benchmarks compared to the base model.
Outperforms binary execution-based RLVR by +1.82 points on average when trained on the high-quality DeepCoder dataset.
Achieves 10x speedup in test-time scaling compared to unit-test based methods (CURE) while maintaining comparable accuracy on CodeForces.

Breakthrough Assessment

8/10

Significantly improves code RL without execution, effectively addressing the data scarcity bottleneck of RLVR. The 10x inference speedup for test-time scaling is highly practical.

⚙️ Technical Details

Problem Definition

Setting: Code generation given a natural language problem description.

Inputs: Programming problem description q

Outputs: Code solution c

Pipeline Flow

Candidate Generation (Policy Model)
Syntax-Aware Extraction
Reward Scoring (CodeScaler)
Selection (Best-of-N) or Optimization (RL)

System Modules

Policy Model

Generates candidate code solutions for a given problem

Model or implementation: Qwen3-8B-Base / Qwen3-14B-Base

Code Extractor

Extracts and validates code from the raw response

Model or implementation: Rule-based + AST parser

CodeScaler Reward Model (Evaluation)

Assigns a scalar score to the extracted code

Model or implementation: Initialized from Qwen3-8B-Base (trained on preference data)

Reward Shaper (Evaluation)

Transforms raw score into final validity-preserving reward

Model or implementation: Mathematical function

Novel Architectural Elements

Validity-preserving reward shaping pipeline: Explicitly coupling AST-based validity checks with a positive-range transformation (Softplus) to enforce a hard penalty on invalid code during RL.

Modeling

Base Model: Qwen3-8B-Base and Qwen3-14B-Base

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Train reward model to distinguish correct from incorrect code.

Formally: Bradley-Terry loss L(phi) = -E[log(sigma(r_phi(c+) - r_phi(c-)))]
Purpose: Optimize policy to maximize expected shaped reward.

Formally: GRPO objective maximizing E[R'(q, c)]

Training Data:

Seed Data: DeepCoder dataset (24K verified problems)
Rollout Generation: Qwen3-8B-Base trained with RLVR on DeepCoder generates responses
Preference Pairs: Correct (pass all tests) vs Incorrect (fail any test) pairs, augmented with 'misaligned' pairs (code from different problems)

Key Hyperparameters:

temperature: 0.6 (evaluation)
sampling_n: 8 (BoN)
continual_training_frequency: Every 50 steps evaluated

Compute: Inference latency comparison performed on single NVIDIA A100 GPU.

Comparison to Prior Work

vs. RLVR: CodeScaler provides dense rewards and does not require execution during training, enabling use of synthetic data without oracle test cases.
vs. Skywork/AceCode-RM: CodeScaler uses on-policy rollouts and validity-preserving shaping to prevent reward hacking, whereas others fail to improve RL performance.
vs. CURE: CodeScaler is execution-free at test time, offering 10x lower latency.
+ 1 more
vs. ArmoRM [not cited in paper]: ArmoRM uses mixture-of-experts for general rewards; CodeScaler is specialized for code with syntax-aware constraints.

Limitations

Performance on synthetic data still lags behind models trained on high-quality verified data (DeepCoder).
Relies on the quality of the initial policy to generate good preference pairs; bootstrapping from weak models might be harder.
Validity checks are syntax-based; logically incorrect but valid code still relies entirely on the learned RM accuracy.

Reproducibility

Code: https://github.com/lark-ai-lab/CodeScaler

Code publicly available at https://github.com/lark-ai-lab/CodeScaler. Model weights for CodeScaler are on HuggingFace. RL training used VeRL library. Detailed training hyperparameters for GRPO are in Appendix B (referenced but not extracted). Seed datasets (DeepCoder, KodCode, rStarCoder) are public.

📊 Experiments & Results

Evaluation Setup

Code generation on standard benchmarks.

Benchmarks:

LiveCodeBench (Code Generation (recent problems))
CodeContests (Competitive Programming)
LiveBench (General Code Generation)
MBPP (Basic Python Programming)
CodeForces (Competitive Programming)
RM-Bench (Reward Model Evaluation)

Metrics:

Avg@8 (Pass@1 averaged over 8 samples)
BoN@8 (Best-of-N pass rate)
Latency (seconds/question)
Accuracy (for RM-Bench)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RL Training results comparing CodeScaler against binary RLVR and other Reward Models on Qwen3-8B-Base using DeepCoder data.
Average (5 benchmarks)	Avg@8	4.51	16.23	+11.72
Average (5 benchmarks)	Avg@8	14.41	16.23	+1.82
Average (5 benchmarks)	Avg@8	10.78	16.23	+5.45
RL Training on synthetic data (KodCode + rStarCoder) demonstrating scalability.
Average (5 benchmarks)	Avg@8	13.62	15.41	+1.79
Test-Time Scaling (Best-of-N) efficiency and effectiveness.
CodeForces	Latency (s/question)	3.20	0.31	-2.89
LiveCodeBench	BoN@8	37.7	43.3	+5.6
RM-Bench (Code)	Accuracy	73.6	76.9	+3.3

Experiment Figures

Left: Comparison of RL training performance using different reward sources (DeepCoder vs KodCode, RLVR vs RMs). Right: Test-time scaling performance (BoN@8) on LiveCodeBench.

Best-of-N (BoN@8) performance comparison on multiple benchmarks for CodeScaler vs SkyworkRM/AceCodeRM/CURE.

Main Takeaways

Dense rewards from CodeScaler provide richer gradients than binary pass/fail signals, leading to better exploration and performance.
CodeScaler enables effective RL training on synthetic data where reliable test cases are absent, unlocking scalability.
Syntax-aware extraction and validity-preserving shaping are critical for stable RM-based RL; without them, models succumb to reward hacking.
CodeScaler serves as a highly efficient ranker for Best-of-N, matching execution-based methods in quality but with 1/10th the latency.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
Reward Modeling (RM)
Proximal Policy Optimization (PPO) or GRPO
Abstract Syntax Trees (AST)

Key Terms

RLVR: Reinforcement Learning from Verifiable Rewards—using binary pass/fail signals from code execution as rewards.

Best-of-N (BoN): A test-time scaling strategy where N solutions are generated and the best one is selected based on a ranking method.

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of samples for the same input to reduce variance.

Bradley-Terry loss: A loss function used to train reward models by maximizing the likelihood of the preferred response having a higher score than the rejected one.

Reward Hacking: When an RL policy exploits flaws in the reward model to get high scores without actually improving task performance.

AST: Abstract Syntax Tree—a tree representation of the syntactic structure of source code, used here to verify code validity.

On-policy rollouts: Data generated by the current version of the policy model during training, as opposed to static offline data.

Test-Time Scaling (TTS): Techniques applied during inference (like generating multiple samples) to improve performance without retraining the model.