PNS: Probability of Necessity and Sufficiency—a causal metric measuring how likely a specific step is both required for the outcome and capable of producing it
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
Rollout: A process where the model generates the remainder of a sequence from a specific intervention point (e.g., after deleting a step) to see how the outcome changes
Intervention: Deliberately changing a variable (in this case, a reasoning step) to observe causal effects, denoted as do(S)
ICL: In-Context Learning—providing examples in the prompt to guide model behavior without updating weights
SFT: Supervised Fine-Tuning—updating model weights on a labeled dataset
GSM-8k: A benchmark dataset of grade school math word problems
AIME: American Invitational Mathematics Examination—a benchmark of difficult math competition problems
PN: Probability of Necessity—probability that the correct answer would not have occurred had the step been removed/changed
PS: Probability of Sufficiency—probability that the step guarantees the correct answer