Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

📝 Paper Summary

Multi-call tool use with flexible plan STEM problem solving

Physics Supernova is an agent system combining Gemini 2.5 Pro with specialized image analysis and self-review tools to achieve gold-medalist performance on the 2025 International Physics Olympiad.

Core Problem

Base LLMs struggle with the complex reasoning, precise figure measurement, and rigorous self-verification required for elite physics competitions like IPhO.

Why it matters:

Physics problems require interpreting visual data (schematics, plots) with precision that text-only models lack
Theoretical results must be physically meaningful; standard LLMs often fail to verify if outputs violate physical constraints or established principles
Existing benchmarks often lack the novelty and fine-grained scoring of fresh Olympiad problems, risking data contamination

Concrete Example: In IPhO 2025 Theory Problem 1 Part C, a model must accurately read values from a figure to solve the problem. A standard LLM might hallucinate or approximate poorly, leading to a mean absolute error of 0.015, whereas Physics Supernova's Image Analyzer reduces this error to 0.004.

Key Novelty

Physics-Oriented CodeAgent with Minimal Pre-definition

Adopts a flexible agent architecture (CodeAgent) where a Manager Agent autonomously plans and calls tools without hard-coded execution graphs
Integrates specialized physics tools: an Image Analyzer for precise data extraction from figures and an Answer Reviewer for checking physical validity (units, constraints)
Demonstrates that equipping a generalist LLM with domain-specific verification and vision tools bridges the gap to elite human performance

Architecture

The agent architecture of Physics Supernova, illustrating the interaction between the Manager Agent and its toolset.

Evaluation Highlights

Ranks 14th among 406 human contestants on IPhO 2025 Theory Problems, exceeding the median gold medalist score
Achieves 23.5/30 total score on IPhO 2025 theory problems, compared to the gold medalist median of 22.8
Reduces Mean Absolute Error (MAE) on figure reading tasks from 0.015 (LLM only) to 0.004 using the Image Analyzer tool

Breakthrough Assessment

9/10

Achieving gold medal performance on a fresh, uncontaminated, elite physics benchmark (IPhO 2025) is a significant milestone, demonstrating that agents can match top human talent in specialized scientific reasoning.

⚙️ Technical Details

Problem Definition

Setting: Solving multi-part physics theory problems containing text and figures

Inputs: A physics problem set Q = {(q_j, s_j)} where q_j is a sub-question and s_j contains associated visual data

Outputs: Final answers for all subquestions, derived through a trajectory of reasoning and tool calls

Pipeline Flow

Manager Agent (receives problem)
Iterative Loop: Reason (Plan) → Code Generation → Tool Execution (Act) → Observation
Tools available: Image Analyzer, Answer Reviewer, Summarizer, WolframAlpha (optional)
Final Answer Generation

System Modules

Manager Agent

Central planner that autonomously decides which tools to call based on problem progress

Model or implementation: Gemini 2.5 Pro

Image Analyzer

Extract precise numeric values and measurements from problem figures

Model or implementation: Gemini 2.5 Pro (VLM capabilities)

Answer Reviewer

Critique intermediate or final results for physical validity (e.g., unit consistency, limiting cases)

Model or implementation: Gemini 2.5 Pro

Summarizer

Compresses the history of observations and actions

Model or implementation: Gemini 2.5 Pro

Novel Architectural Elements

Integration of domain-specific 'physicist' tools (Answer Reviewer, Image Analyzer) into a flexible CodeAgent framework rather than a fixed chain
Self-planning capability where the agent writes code to orchestrate its own tool usage for physics derivation

Modeling

Base Model: Gemini 2.5 Pro

Compute: Not reported in the paper

Comparison to Prior Work

vs. ReAct: Specifically tailored for physics with specialized vision and review tools, whereas ReAct is general-purpose
vs. LLM-only baselines: Physics Supernova uses an agentic loop with external tools, enabling precise measurement and self-correction that plain LLMs lack
vs. Fixed-workflow solvers (e.g., Lean-based): Physics Supernova uses flexible self-planning rather than hard-coded execution graphs

Limitations

Relies on closed-source Gemini 2.5 Pro model; performance with open weights not explored
Variance in performance is higher on more difficult problems (Theory Problem 2)
Summarization memory is a lightweight workaround for context limits, potentially losing detail in very long derivations

Reproducibility

Code: https://github.com/CharlesQ9/Physics-Supernova

Code is publicly available at https://github.com/CharlesQ9/Physics-Supernova. The system relies on Gemini 2.5 Pro (closed source API). Prompts for Image Analyzer and Answer Reviewer are in the Appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on IPhO 2025 Theory Problems

Benchmarks:

IPhO 2025 Theory Problems (Physics competition problem solving)

Metrics:

Total Score (out of 30)
Part-level scores
Rank among human contestants
Statistical methodology: Means and standard deviations reported over 5 independent runs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Physics Supernova significantly outperforms the base LLM and reaches gold medal standards on IPhO 2025.
IPhO 2025	Total Theory Score	15.6	23.5	+7.9
IPhO 2025	Rank (Lower is better)	30	14	-16
Ablation studies demonstrate the critical contribution of both the Image Analyzer and Answer Reviewer tools.
IPhO 2025	Total Theory Score	21.6	23.5	+1.9
IPhO 2025	Total Theory Score	18.8	23.5	+4.7

Experiment Figures

Comparison of figure reading accuracy between LLM Only and Image Analyzer on Theory Problem 1 Part C.

Main Takeaways

Physics Supernova achieves gold-medalist level performance, ranking 14th out of 406 human participants
The gap between the agent and the base LLM is largest on the most difficult problems (Theory Problem 2), suggesting agents handle complexity better
Image Analyzer reduces measurement error significantly (MAE 0.015 -> 0.004), enabling success on figure-dependent sub-questions
Answer Reviewer acts as a critical safety check, preventing 'unphysical' answers that would otherwise lower the score

📚 Prerequisite Knowledge

Prerequisites

Understanding of agentic workflows (Reason-Act loops)
Familiarity with Large Language Models and tool use
Basic knowledge of physics problem types (theory vs. experimental)

Key Terms

IPhO: International Physics Olympiad—the most prestigious international physics competition for high school students

CodeAgent: An agent architecture from the smolagents framework where the agent writes and executes code to call tools

Reason-Act loop: An iterative process where an agent first generates a reasoning step (thought) and then performs an action (tool call)

MAE: Mean Absolute Error—a measure of errors between paired observations expressing the same phenomenon

WolframAlpha: A computational knowledge engine that answers factual queries by computing answers from externally sourced data

VLM: Vision Language Model—a model capable of understanding and processing both images and text

ReAct: Reasoning + Acting—a paradigm where LLMs generate reasoning traces and task-specific actions in an interleaved manner