GRPO: Group Relative Policy Optimization—a policy gradient method that normalizes advantages within a group of outputs generated from the same prompt
Fork-Join: A parallel programming model where execution splits into parallel branches (fork) and merges back (join) at a synchronization point
Trie: A prefix tree data structure; here used to merge multiple reasoning branches into a single training sequence with shared prefixes
KV cache: Key-Value cache—storage of computed attention keys and values to speed up autoregressive generation
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples
Ancestor-only attention: An attention mask where a token can only attend to itself and its predecessors in the dependency tree (trie), preventing cross-branch information leakage
Critical path: The longest sequence of dependent operations in a parallel execution graph, determining the minimum total latency
Pareto frontier: The set of optimal trade-offs between two conflicting objectives (here, speed vs. accuracy) where improving one requires sacrificing the other
Self-consistency: A method where an LLM generates multiple reasoning paths and aggregates the results (e.g., via majority vote) to improve accuracy