Evaluation Setup
Multimodal QA and Captioning under constrained modality access (uni-, bi-, and tri-modal)
Benchmarks:
- MAPLE-QA (Multiple-choice QA (Discriminative)) [New]
- MAPLE-Caption (Open-ended Captioning (Generative)) [New]
Metrics:
- Pass@1 Accuracy
- Modality Gap (performance scaling across RMT groups)
- Training Efficiency (wall-clock time)
- Fusion Gain
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MAPLE-QA |
Convergence Speed |
1.0 |
3.18 |
+2.18
|
| MAPLE-QA |
Uni/Multi-modal Accuracy Gap Reduction |
0.0 |
30.24 |
+30.24
|
| MAPLE-QA |
Pass@1 Accuracy |
58.58 |
58.68 |
+0.10
|
| MAPLE-QA |
Policy Gradient Variance Reduction |
0.0 |
12.89 |
+12.89
|
Main Takeaways
- Stratifying batches by required modality reduces gradient variance (by ~13%), as rewards are normalized against comparable difficulty levels.
- Modality-aware training (MAPO) converges significantly faster (3.18x) than treating all data as a uniform distribution.
- The approach improves robustness, narrowing the performance gap between tasks requiring single vs. multiple modalities, which is critical for real-world deployments where sensors may fail.