Evaluation Setup
Multimodal diagnostic question answering and reasoning evaluation across diverse clinical domains.
Benchmarks:
- 8 Clinical Vision Modalities (Diagnostic QA)
- Reasoning Trace Evaluation (Salient region grounding (IoU))
Metrics:
- Macro-F1 score (diagnosis)
- Intersection over Union (IoU) (grounding/interpretability)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| DRPO training significantly outperforms GRPO in diagnostic accuracy across diverse modalities. |
| Average across 8 clinical vision modalities |
Macro-F1 |
Not reported in the paper |
Not reported in the paper |
+43% (relative improvement)
|
| Salient Region Highlighting |
IoU |
Not reported in the paper |
Not reported in the paper |
10x higher
|
Main Takeaways
- DRPO effectively mitigates the performance imbalance caused by skewed clinical data distributions, preventing the model from overfitting to easy/abundant domains.
- QoQ-Med successfully integrates 1D ECG data with standard 2D/3D imaging, a capability missing in prior models like LLaVa-Med or Med-Flamingo.
- The model achieves high interpretability by accurately bounding salient regions (high IoU), matching proprietary models like OpenAI o4-mini in this specific capability.
- Hierarchical scaling in DRPO allows the model to prioritize learning from scarce and hard domains without the computational cost of a critic network.