GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs to optimize the policy without a learned critic
Thompson Sampling: A heuristic for choosing actions that addresses the exploration-exploitation dilemma by sampling from a probability distribution describing the expected reward of each action
CodeBLEU: A metric for code evaluation that considers syntactic and semantic similarity (data flow, structure) rather than just n-gram matching
pass@k: A metric measuring the probability that at least one of the top k generated code samples passes all unit tests
Beta distribution: A continuous probability distribution bounded between 0 and 1, often used in Bayesian inference to model the probability of success (used here for Thompson Sampling)
data augmentation: The process of artificially increasing the diversity and size of training data; here, the tree search generates diverse debugging paths for the model to learn from
AdamW: A stochastic optimization method that modifies the typical implementation of weight decay in Adam, decoupling it from the gradient update