Evaluation Setup
Chart Question Answering across diverse benchmarks involving extraction, reasoning, and robustness checks.
Benchmarks:
- MultiChartQA (Multi-hop reasoning across multiple charts)
- ChartInsights (Fine-grained analytics across 7 chart types)
- RobustCQA (Robustness to visual perturbations)
- MathVerse (Visual mathematical problem solving (Out-of-Domain))
Metrics:
- Accuracy (Exact Match or Relaxed Accuracy depending on dataset)
- Statistical methodology: Reported statistical significance at alpha=0.05 for main results.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main comparison results demonstrating Chart-RL superiority over SFT baselines on chart comprehension benchmarks. |
| MultiChartQA |
Accuracy |
44.1 |
58.1 |
+14.0
|
| ChartInsights |
Accuracy |
48.2 |
53.7 |
+5.5
|
| Robustness analysis shows Chart-RL improves consistency across visual perturbations. |
| RobustCQA |
Category Improvement Count |
2 |
18 |
+16
|
| Out-of-domain generalization to visual math problems. |
| MathVerse |
Accuracy |
28.8 |
44.8 |
+16.0
|
Main Takeaways
- Task complexity is more critical than data quantity: training on 10 complex examples outperformed training on 6,000 simple examples.
- Chart-RL improves robustness to visual variations (layout, style) better than SFT.
- RL training on charts facilitates transfer to out-of-domain tasks like visual mathematics (MathVerse), suggesting learned reasoning skills are generalizable.
- SFT often leads to regression compared to the baseline VLM on complex chart tasks, whereas RL consistently improves performance.