| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Outcome evaluation shows a massive gap between SOTA LMMs and human students, and a surprisingly small gap between Text-Only and Multimodal inputs. | ||||
| MM-MATH | Accuracy | 80.4 | 31.8 | -48.6 |
| MM-MATH | Accuracy | 11.6 | 31.8 | +20.2 |
| MM-MATH | Accuracy | 27.6 | 31.8 | +4.2 |
| MM-MATH | Accuracy | 23.2 | 25.9 | +2.7 |
| Performance degrades significantly as problem difficulty increases. | ||||
| MM-MATH (Hard Subset) | Accuracy | 45.8 | 10.9 | -34.9 |