_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
ReAct: Reason+Act—a prompting framework where the model alternates between generating reasoning text and executing actions (tool calls)
Gold tools: The specific set of tools required to solve a given problem correctly, derived from the ground-truth solution
Distractors: Tools included in the environment that are not needed for the solution but may be semantically similar to the required tools
Logical hop: A metric representing the depth of dependent tool use required to solve a problem (sequence length of necessary tool calls)
Distractors-only: An evaluation setting where the required (gold) tools are removed, testing the model's ability to abstain or fallback rather than hallucinate
Tool-wise validation: Checking if a tool's Python implementation matches its natural language description using test cases and an LLM judge
Question-wise validation: Verifying that a problem is empirically solvable by at least one model using the provided tool set
DFSDT: Depth-First Search Decision Tree—a tool-use protocol that explores solution paths via branching and backtracking
SFT: Supervised Fine-Tuning—training a model on labeled examples
Plan+ReAct: A variant of ReAct where the model first generates a high-level plan before entering the reasoning/action loop