← Back to Paper List

PhyCritic: Multimodal Critic Models for Physical AI

Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu
NVIDIA, University of Maryland
arXiv (2026)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Physical AI / Embodied AI Reward Modeling / Critic Models
PhyCritic is a multimodal critic model optimized for physical AI via a two-stage reinforcement learning pipeline that grounds judgments in the model's own internal physical reasoning.
Core Problem
Existing multimodal critic models are trained on general domains (captioning, QA) and lack the physical awareness to evaluate whether reasoning is causally valid or spatially correct in embodied scenarios.
Why it matters:
  • Current critics fail to distinguish visually coherent but physically impossible reasoning, which is dangerous for safety-critical domains like autonomous driving and robotics
  • Reliable evaluation is essential for scaling physical AI, but existing benchmarks overlook affordance reasoning and causal dynamics
  • Standard critics do not ground decisions in their own physical understanding, leading to inconsistent or superficial verdicts
Concrete Example: A standard critic might approve a robot's plan to 'pick up the mug' based on visual captioning alignment, failing to notice the mug is physically obstructed or out of reach, whereas a physics-aware critic would identify the affordance violation.
Key Novelty
Self-Referential Critic Finetuning for Physical AI
  • Treats the critic like an expert human judge: the model must first solve the physical reasoning problem itself before evaluating other models' answers
  • Uses a two-stage RLVR pipeline: first warming up physical skills, then training the critic to generate an internal reference prediction and ground its critique in that prediction
  • Optimizes using GRPO (Group Relative Policy Optimization) with a composite reward function that incentivizes both correct self-prediction and accurate preference ranking
Evaluation Highlights
  • Outperforms open-source baselines on Cosmos-Reason1 validation set with +4.1% accuracy gain
  • Achieves highest performance among 7B/8B models on the new PhyCritic-Bench, surpassing InternVL2.5-8B-MPO by +8.6%
  • Demonstrates strong generalization to general reward benchmarks, exceeding Qwen2-VL-7B-Instruct by +10.0% on VL-RewardBench
Breakthrough Assessment
8/10
Significantly advances multimodal evaluation by introducing self-referential grounding for physical tasks. Addresses a critical gap in embodied AI evaluation with a novel training paradigm and benchmark.
×