Evaluation Setup
Medical Visual Question Answering (VQA) across diverse datasets evaluating both reasoning quality and answer accuracy
Benchmarks:
- Silvar-Med (Subset) (Reasoning-focused Medical VQA) [New]
- VQA-RAD (Radiology VQA)
- SLAKE (English) (Bilingual Medical VQA)
- VQA-Med 2019 (Medical VQA)
- Path-VQA (Pathology VQA)
Metrics:
- Reasoning Accuracy (Human Eval / LLM-as-Judge)
- Final Answer Accuracy (Exact Match / BERTScore / LLM-as-Judge)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| RARL significantly improves performance on reasoning-focused tasks compared to standard Supervised Fine-Tuning (SFT). |
| Silvar-Med (Curated Test Set) |
Reasoning Accuracy (Human Eval) |
57.08 |
64.86 |
+7.78
|
| Ablation study showing the impact of RARL combined with Diversity Prompting on unseen benchmarks (Generalization). |
| VQA-RAD |
Accuracy (GPT-4o mini) |
Not reported in the paper |
Not reported in the paper |
+4.17
|
| SLAKE |
Accuracy (GPT-4o mini) |
Not reported in the paper |
Not reported in the paper |
+9.18
|
| Path-VQA |
Accuracy (GPT-4o mini) |
Not reported in the paper |
Not reported in the paper |
+4.41
|
Main Takeaways
- Reasoning-Aware RL (RARL) outperforms Supervised Fine-Tuning (SFT) across both in-domain reasoning tasks and unseen benchmarks.
- Diversity prompting (mixing explanation-required, short-form, and open-ended prompts) is crucial for generalization.
- Training on small datasets (500-1000 samples) with RL + LoRA is more effective than SFT, making it suitable for data-scarce medical domains.
- A gap persists between reasoning quality and final answer accuracy; models may reason correctly but fail to output the exact ground truth format, or vice versa.