RLEF: Reinforcement Learning with Execution Feedback—the proposed method of training LLMs to use execution results (errors, test outputs) to repair code via RL
PPO: Proximal Policy Optimization—an RL algorithm used here to fine-tune the LLM policy
pass@k: A metric measuring the probability that at least one of k generated samples is correct
n@k: Average solve rate: expectation that any of n solutions selected from k samples is correct
CodeContests: A challenging competitive programming dataset with private test cases used for evaluation
public tests: Test cases visible to the model during the iterative process to generate feedback
private tests: Held-out test cases used only for final reward calculation and evaluation, ensuring the model doesn't just overfit to specific inputs
KL penalty: A regularization term preventing the RL-tuned model from deviating too far from the original reference model distribution
SFT: Supervised Fine-Tuning
Instruct model: An LLM fine-tuned to follow instructions, used here as the initialization for RLEF
rollout: One complete episode of interaction (generating code, getting feedback, generating again) up to the turn limit