Evaluation Setup
Pairwise preference prediction across physical and general domains
Benchmarks:
- PhyCritic-Bench (Physical AI Reasoning & Planning (Robotics + Autonomous Driving)) [New]
- Cosmos-Reason1 (Val) (Physical Common Sense & Causal Reasoning)
- VL-RewardBench (General Multimodal Reward Modeling)
- Multimodal RewardBench (General Multimodal Reward Modeling)
Metrics:
- Accuracy (Preference Prediction)
- Accuracy (as Policy Model)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| PhyCritic achieves superior performance on the newly proposed physical AI benchmark compared to open-source baselines. |
| PhyCritic-Bench |
Accuracy |
73.3 |
81.9 |
+8.6
|
| PhyCritic-Bench |
Accuracy |
73.6 |
81.9 |
+8.3
|
| The model also generalizes well to general-purpose multimodal reward benchmarks. |
| VL-RewardBench |
Overall Accuracy |
76.4 |
86.4 |
+10.0
|
| When used as a policy model (answering questions directly), PhyCritic improves over its base model. |
| Cosmos-Reason1 (Val) |
Accuracy |
57.3 |
61.4 |
+4.1
|
Main Takeaways
- Self-referential training significantly boosts critic accuracy by grounding judgments in internal reasoning.
- Physical domain training transfers positively to general multimodal reward tasks, suggesting physical reasoning is a core capability.
- The two-stage pipeline (Skill Warmup + Critic Finetuning) effectively transforms a standard VLM into a specialized physical critic.
- PhyCritic outperforms much larger or more specialized baselines (like InternVL2.5-MPO) on physical benchmarks despite being 7B parameters.