Teacher Forcing: A training method where the model receives ground-truth intermediate outputs as input for the next step, rather than its own generated outputs
Self-consistency: A technique where the model generates multiple reasoning paths or uses auxiliary data to verify that intermediate steps are consistent before proceeding
Parity Problem: A classic hard learning problem where the label is the sum modulo 2 (or product of ±1) of a subset of input bits; known to be hard for gradient descent
k-parity: The specific version of the parity problem where the target depends on exactly k bits of the input
SQ (Statistical Query) hardness: A complexity class limitation implying that algorithms relying on statistical expectations (like gradients) need exponential queries to learn certain functions (like parity)
Process supervision: Training signals provided on the intermediate steps of reasoning (the 'process') rather than just the final answer
One-layer transformer: A simplified transformer model with a single attention head and feedforward layer, applied recursively to generate sequences
Task decomposition: Breaking a complex problem (k-parity) into a hierarchy of simpler sub-problems (2-parity) arranged in a tree structure