GRPO: Group Relative Policy Optimization—an RL algorithm that ranks a group of generated outputs against each other to update the policy, removing the need for a separate critic model
DPO: Direct Preference Optimization—a method to align models to preferences (like efficiency) using static pairs of preferred/dispreferred data
SFT: Supervised Fine-Tuning—training the model to mimic high-quality examples (efficient code) given inputs (inefficient code)
IOF: Iterative Optimization Framework—the paper's proposed loop where code is generated, executed, and refined in cycles
Monolith: The paper's execution sandbox that provides feedback on correctness, execution time, and memory usage
Pass@1: The percentage of problems where the model's first generated solution is functionally correct
Beyond-I: A metric measuring how often the model's generated code is more efficient than human-submitted reference solutions