Evaluation Setup
Visual perception tasks (REC, OVD, ISR) evaluated on standard benchmarks
Benchmarks:
- RefCOCOg (Referring Expression Comprehension (REC))
- COCO2017 (Open-Vocabulary Object Detection (OVD))
- 3D-FRONT (Indoor Scene Refinement (ISR))
Metrics:
- Accuracy (Acc@0.5)
- Mean Average Precision (mAP)
- Aesthetic Score (for ISR)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results comparing Syn-GRPO against SFT and Visual-RFT baselines across three tasks. |
| RefCOCOg (REC) |
Acc@0.5 |
83.2 |
86.6 |
+3.4
|
| COCO2017 (OVD) |
mAP |
50.1 |
54.0 |
+3.9
|
| 3D-FRONT (ISR) |
Aesthetic Score |
5.62 |
5.71 |
+0.09
|
| RefCOCOg (REC) |
Acc@0.5 |
83.2 |
86.6 |
+3.4
|
Main Takeaways
- Syn-GRPO significantly outperforms standard GRPO (Visual-RFT) across all tested visual perception tasks.
- The method prevents diversity collapse: diversity metrics remain stable or improve, unlike baselines where they plummet.
- Generated data becomes increasingly complex and diverse over training iterations, suggesting true self-evolution.
- Scalability: The performance gap between Syn-GRPO and baselines widens as the amount of initial training data increases.