HICRA: Hierarchy-Aware Credit Assignment—The proposed RL algorithm that weights updates for strategic planning tokens higher than procedural tokens
Strategic Grams: n-grams (phrases) that function as semantic units for high-level reasoning moves (deduction, branching, backtracing), identified via clustering and frequency analysis
GRPO: Group Relative Policy Optimization—A prevailing RL algorithm that normalizes rewards within a group of outputs; used here as a baseline that applies pressure agnostically
Semantic Entropy: The entropy of the frequency distribution of Strategic Grams; a metric used to quantify the diversity of high-level strategic plans
Cluster Document Frequency: The frequency of unique solutions containing at least one n-gram from a specific semantic cluster; used to identify reusable strategic scaffolds
Relative Perplexity: Perplexity normalized by its initial value, used to track the rate of improvement for different token types (execution vs. planning)
Pass@K: A metric measuring the probability that at least one of K generated solutions is correct