SFT: Supervised Fine-Tuning—training a model on labeled examples (problem, solution pairs) to learn the desired behavior
RL: Reinforcement Learning—training a model to maximize a reward signal, here based on passing test cases
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance
AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code, used here to check for syntax errors
Dual-verification: A strategy proposed in this paper that cross-checks synthetic solutions against synthetic test cases to filter out incorrect data
Pass@k: A metric estimating the probability that at least one of the top k generated solutions is correct
LiveCodeBench: A benchmark for code generation that focuses on recent competitive programming problems to avoid data contamination
TACO: A large-scale dataset of competitive programming problems used here as a source for feature extraction