fs1: Factual Simple Test-time Scaling—the proposed method of fine-tuning models on reasoning traces grounded by Knowledge Graph paths.
rt: Raw Reasoning Traces—traces extracted directly from large reasoning models (like DeepSeek-R1) without external grounding.
pass@k: A metric measuring the probability that at least one correct answer exists in k generated samples.
KG path: A sequence of entities and relations from a Knowledge Graph (e.g., Wikidata) connecting the question subject to the answer.
test-time scaling: Improving model performance by increasing computation during inference, often by generating multiple samples and selecting the best one.
LLM-as-a-judge: Using a strong LLM (e.g., Llama-3.3-70B) to evaluate whether a generated answer is semantically equivalent to the gold standard.
SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset of inputs and targets.
linearized graph: Representing a graph structure (nodes and edges) as a sequence of text tokens so an LLM can process it.