Long CoT: Extended reasoning chains generated by models like o1 or DeepSeek-R1, often containing thousands of tokens and internal reflections
Process Reward Model (PRM): A model trained to evaluate the correctness of intermediate reasoning steps rather than just the final answer
o1-like models: Large Language Models designed to 'think' for extended periods before answering, producing long, complex reasoning traces
Macro-F1: A metric that calculates F1 scores for each sample independently and then averages them, used here to handle imbalance between error and non-error sections
HitRate@k: A metric measuring the proportion of samples where at least one true error section is found within the top-k sections ranked by the model
PCB: Physics, Chemistry, and Biology domain
Section-level Segmentation: Dividing a long response into semantic sub-tasks (using logic or delimiters) rather than line-by-line steps
Z-Score: A statistical measurement of a score's relationship to the mean in a group of scores, used here for outlier detection in PRM rewards