HLE: Humanity's Last Exam—a challenging benchmark developed by experts to evaluate AI on frontier scientific knowledge
Inference-time computation: Spending more computational resources during the generation phase (e.g., through multiple drafts, verification steps, or search) rather than just during training
Code as Interaction Language: A design paradigm where the agent uses executable programming code (Python) to interface with tools, offering higher precision than natural language or JSON
Scattered-and-Stacked: A workflow strategy alternating between parallel generation of diverse solutions (scattering) and aggregating/refining them (stacking)
Initial Reasoning Guidance: A prompting technique that injects first-person instructions into the model's context to steer a non-agentic model into behaving like an agent
Rollouts: In reinforcement learning, simulating multiple future trajectories to estimate the value of a current state; used here as an analogy for parallel solution generation
DeepSeek-R1: An open-source reasoning model used as the backbone for the agents in this paper