PRM: Process-Supervised Reward Model—a model trained to score individual steps of a reasoning process (positive/neutral/negative) rather than just the final outcome
HGS-PRM: Heuristic Greedy Search with PRM—the authors' proposed algorithm that uses PRM scores to decide whether to keep expanding a reasoning path or backtrack
CoT: Chain of Thought—a prompting technique that encourages LLMs to generate intermediate reasoning steps
AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code, used here to programmatically mutate code for data generation
Mutation Testing: A software testing technique where 'mutants' (modified versions of code) are created to test the robustness of a test suite; used here to generate 'negative' code examples
pass@1: A metric for code generation measuring the percentage of problems where the first generated solution passes unit tests
RLHF: Reinforcement Learning from Human Feedback—training method to align models using reward models derived from human preferences
GSM8K: Grade School Math 8K—a dataset of grade school math word problems
MBPP: Mostly Basic Python Problems—a benchmark dataset for code generation
SFT: Supervised Fine-Tuning—training a pre-trained model on specific task data (e.g., math instructions)