SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

📝 Paper Summary

Multi-call tool use with flexible plan Multi-agent Agentic reasoning

SciMaster achieves state-of-the-art performance on scientific benchmarks by using a scattered-and-stacked multi-agent workflow where models write and execute Python code to interact with external tools.

Core Problem

Existing strong reasoning models are either non-agentic (limited tool use) or closed-source (OpenAI o3, Google Deep Research), limiting community progress in applying AI to complex scientific discovery.

Why it matters:

Closed-source nature of leading models (OpenAI, Google) prevents researchers from understanding or building upon the mechanisms for scientific problem solving
Standard LLMs lack the ability to dynamically verify facts or perform complex calculations required for frontier scientific questions
Accelerating scientific discovery requires agents that can autonomously navigate the internet and use libraries, mimicking human research workflows

Concrete Example: When faced with a frontier scientific question from Humanity's Last Exam (HLE), a standard model might hallucinate or fail to retrieve up-to-date data. X-Master writes Python code to search the web, parse specific papers from ar5iv, and calculate answers using NumPy, correcting itself based on execution feedback.

Key Novelty

Scattered-and-Stacked Agentic Workflow (X-Masters)

Concept: Scales inference-time compute by alternating between broad exploration ('scattering' via parallel solvers) and deep refinement ('stacking' via aggregation and selection)
Mechanism: Uses 'Code as Interaction Language', where the model generates Python scripts to interact with tools (web search, paper parsing) rather than JSON or special tokens, allowing flexible feedback loops
Guidance: Instead of training, uses 'Initial Reasoning Guidance' (injecting first-person self-statements into the context) to trick non-agentic models into adopting agentic behaviors

Architecture

The X-Masters scattered-and-stacked workflow

Evaluation Highlights

Achieves 32.1% on Humanity's Last Exam (HLE), setting a new state-of-the-art record
Surpasses OpenAI's best result (26.6%) by 5.5 percentage points on HLE
Outperforms Google's Deep Research (26.9%) by 5.2 percentage points on HLE

Breakthrough Assessment

9/10

First open-source model to beat OpenAI and Google on the extremely difficult HLE benchmark (passing the 30% threshold), demonstrating that inference-time scaling strategies can outperform proprietary black-box models.

⚙️ Technical Details

Problem Definition

Setting: General-purpose scientific question answering requiring external knowledge and reasoning

Inputs: Natural language query q (potentially time-sensitive/knowledge-intensive)

Outputs: Final answer Ans_final derived from multi-step reasoning and tool use

Pipeline Flow

Scattering Phase 1: Multiple Solvers generate initial solutions in parallel
Scattering Phase 2: Critics evaluate and amend Solver outputs
Stacking Phase 1: Rewriters synthesize/rewrite solutions based on Critic feedback
Stacking Phase 2 (Selection): Selector chooses the single best final answer

System Modules

Solver

Generates initial solutions using tool-augmented reasoning

Model or implementation: DeepSeek-R1

Critic

Diagnoses flaws in Solver solutions and provides corrections

Model or implementation: DeepSeek-R1

Rewriter

Synthesizes preceding outputs into superior solutions

Model or implementation: DeepSeek-R1

Selector

Adjudicates the single best answer from the Rewriter candidates

Model or implementation: DeepSeek-R1

Novel Architectural Elements

Integration of 'Code as Interaction Language' mechanism directly into the reasoning loop (between <think> tags) via string matching
Scattered-and-Stacked workflow topology: specifically the sequence of Parallel Solvers -> Parallel Critics -> Parallel Rewriters -> Single Selector

Modeling

Base Model: DeepSeek-R1

Comparison to Prior Work

vs. OpenAI/Google: X-Masters is open-source and relies on inference-time workflow engineering rather than proprietary model training/infrastructure
vs. DeepSeek R1 (base): X-Masters adds tool-use (web search, python execution) and a multi-agent workflow (scatter-stack) to the base model
vs. ReAct [not cited in paper]: ReAct interleaves thought and action in a linear chain; X-Masters uses a parallelized (scattered) approach with explicit critique and rewriting phases

Limitations

High computational cost due to multiple parallel agent instances (5 Solvers, 5 Critics, 5 Rewriters)
Dependency on the underlying reasoning capability of the base model (DeepSeek-R1)
Latency issues associated with synchronous web searches and paper parsing

Reproducibility

The paper states the solution is 'open-source' and 'related code will be publicly available'. It uses the open-source DeepSeek-R1 model. The specific prompt templates for Initial Reasoning Guidance are described in the text.

📊 Experiments & Results

Evaluation Setup

Evaluation on expert-level scientific problems

Benchmarks:

Humanity's Last Exam (HLE) (Multi-disciplinary scientific QA (frontier knowledge))

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Humanity's Last Exam (HLE)	Accuracy	26.6	32.1	+5.5
Humanity's Last Exam (HLE)	Accuracy	26.9	32.1	+5.2
Humanity's Last Exam (HLE)	Accuracy	30.0	32.1	+2.1

Main Takeaways

Inference-time compute scaling (via scattering and stacking) allows open-source models to outperform proprietary models on complex reasoning tasks.
Tool augmentation (web search, Python execution) is critical for solving frontier scientific problems where internal knowledge is insufficient.
The 'Code as Interaction Language' paradigm effectively bridges reasoning models with external environments without requiring extensive retraining.

📚 Prerequisite Knowledge

Prerequisites

Foundational understanding of Large Language Models (LLMs) and prompting
Basics of Reinforcement Learning (specifically rollouts/exploration-exploitation)
Knowledge of agentic workflows (ReAct, multi-agent systems)

Key Terms

HLE: Humanity's Last Exam—a challenging benchmark developed by experts to evaluate AI on frontier scientific knowledge

Inference-time computation: Spending more computational resources during the generation phase (e.g., through multiple drafts, verification steps, or search) rather than just during training

Code as Interaction Language: A design paradigm where the agent uses executable programming code (Python) to interface with tools, offering higher precision than natural language or JSON

Scattered-and-Stacked: A workflow strategy alternating between parallel generation of diverse solutions (scattering) and aggregating/refining them (stacking)

Initial Reasoning Guidance: A prompting technique that injects first-person instructions into the model's context to steer a non-agentic model into behaving like an agent

Rollouts: In reinforcement learning, simulating multiple future trajectories to estimate the value of a current state; used here as an analogy for parallel solution generation

DeepSeek-R1: An open-source reasoning model used as the backbone for the agents in this paper