Evaluation Setup
Multiple-choice question answering on the constructed Image Screening Dataset.
Benchmarks:
- Image Screening Dataset (Ours) (Visual Aesthetic Reasoning / Flaw Detection) [New]
- Public Benchmarks (MMBench, MME, etc.) (General Multimodal Understanding)
Metrics:
- Accuracy (Score)
- Pass@1 (implied by single score reporting)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main comparison on the proposed Image Screening Dataset showing the superiority of HCM-GRPO-2B over much larger models. |
| Image Screening Dataset |
Score |
44.15 |
64.74 |
+20.59
|
| Image Screening Dataset |
Score |
45.20 |
64.74 |
+19.54
|
| Image Screening Dataset |
Score |
40.35 |
64.74 |
+24.39
|
| Image Screening Dataset |
Score |
57.75 |
64.74 |
+6.99
|
| Image Screening Dataset |
Score |
61.35 |
64.74 |
+3.39
|
Main Takeaways
- Existing SOTA models (GPT-4o, Qwen-VL-Max) perform poorly on image aesthetic reasoning, often close to random guessing.
- HCM-GRPO allows a small 2B model to significantly outperform 70B+ models and closed-source APIs on this specific task.
- The two-stage training (SFT cold start + HCM-GRPO) is critical for performance.
- Hard Cases Mining effectively forces the model to learn from its errors, boosting performance beyond standard GRPO.