← Back to Paper List

Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

Jiahao Qiu, Jingzhe Shi, Xinzhe Juan, Zelin Zhao, Jiayi Geng, Shilong Liu, Hongru Wang, Sanfeng Wu, Mengdi Wang
Princeton University, Tsinghua University, Shanghai Jiao Tong University, University of Michigan, King's College London
arXiv.org (2025)
Agent MM Reasoning Benchmark

📝 Paper Summary

Multi-call tool use with flexible plan STEM problem solving
Physics Supernova is an agent system combining Gemini 2.5 Pro with specialized image analysis and self-review tools to achieve gold-medalist performance on the 2025 International Physics Olympiad.
Core Problem
Base LLMs struggle with the complex reasoning, precise figure measurement, and rigorous self-verification required for elite physics competitions like IPhO.
Why it matters:
  • Physics problems require interpreting visual data (schematics, plots) with precision that text-only models lack
  • Theoretical results must be physically meaningful; standard LLMs often fail to verify if outputs violate physical constraints or established principles
  • Existing benchmarks often lack the novelty and fine-grained scoring of fresh Olympiad problems, risking data contamination
Concrete Example: In IPhO 2025 Theory Problem 1 Part C, a model must accurately read values from a figure to solve the problem. A standard LLM might hallucinate or approximate poorly, leading to a mean absolute error of 0.015, whereas Physics Supernova's Image Analyzer reduces this error to 0.004.
Key Novelty
Physics-Oriented CodeAgent with Minimal Pre-definition
  • Adopts a flexible agent architecture (CodeAgent) where a Manager Agent autonomously plans and calls tools without hard-coded execution graphs
  • Integrates specialized physics tools: an Image Analyzer for precise data extraction from figures and an Answer Reviewer for checking physical validity (units, constraints)
  • Demonstrates that equipping a generalist LLM with domain-specific verification and vision tools bridges the gap to elite human performance
Architecture
Architecture Figure Figure 1
The agent architecture of Physics Supernova, illustrating the interaction between the Manager Agent and its toolset.
Evaluation Highlights
  • Ranks 14th among 406 human contestants on IPhO 2025 Theory Problems, exceeding the median gold medalist score
  • Achieves 23.5/30 total score on IPhO 2025 theory problems, compared to the gold medalist median of 22.8
  • Reduces Mean Absolute Error (MAE) on figure reading tasks from 0.015 (LLM only) to 0.004 using the Image Analyzer tool
Breakthrough Assessment
9/10
Achieving gold medal performance on a fresh, uncontaminated, elite physics benchmark (IPhO 2025) is a significant milestone, demonstrating that agents can match top human talent in specialized scientific reasoning.
×