Evaluation Setup
Visual Question Answering on Ultra-High-Resolution images using fixed inference budget
Benchmarks:
- XLRS-Bench (Ultra-High-Resolution Remote Sensing VQA)
Metrics:
- Pass@1 (Average Performance)
- Pass@32 (Reasoning Boundary)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| XLRS-Bench |
Pass@1 |
Not reported in the paper |
60.40 |
Not reported in the paper
|
| XLRS-Bench |
Pass@1 |
60.40 |
54.49 |
-5.91
|
| XLRS-Bench |
Pass@32 |
Not reported in the paper |
Not reported in the paper |
-0.50
|
Main Takeaways
- High-quality Earth-science text-only QA is a primary driver of visual reasoning gains in UHR scenarios, even without images.
- Reasoning boundary (Pass@32) is driven by domain-prior coverage, while average performance (Pass@1) is driven by reasoning structure (CoT) and agentic tuning.
- Agentic RLVR is unstable without sufficient domain supervision; 'pre-warming' with hard image-text pairs is essential.