GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input against their group mean, removing the need for a critic model
SFT: Supervised Fine-Tuning—training a model to mimic expert demonstrations
Honest Deductive Reasoning: The ability of a model to produce a conclusion only when it is logically entailed by premises, and to abstain (output 'Unknown') otherwise
DAH: Directed Acyclic Hypergraph—a graph structure where edges connect a set of premise nodes to a conclusion node, used here to model reasoning chains
Rollout: A single complete sequence generated by the model during the exploration phase of Reinforcement Learning
Reasoning Depth (k): The number of steps (edges) in the derivation path from premises to the query conclusion
Cut Depth (d): In unanswerable instances, the distance from the query node where the reasoning chain is broken (edge removed)