GRPO: Group Relative Policy Optimisation—a critic-free RL algorithm that normalizes rewards within a sampled group to estimate advantages.
DPO: Direct Preference Optimization—an algorithm that aligns models to preferences by increasing the likelihood of chosen responses over rejected ones.
RRG: Radiology Report Generation—the task of automatically writing medical reports from imaging.
ZRR: Zero-Reward Rate—a metric introduced in this paper to quantify how often the model produces a group of outputs with no valid reward signal.
FactS: FactScore-based Reward—a proposed reward function that extracts atomic facts from generated text and checks their truthfulness against ground-truth labels.
Entailment: In this context, checking if a generated clinical fact is logically consistent with or supported by the ground-truth diagnostic labels.
Qwen2.5-VL: The specific Vision-Language Model backbone used as the policy in this paper.
SFT: Supervised Fine-Tuning—standard training on labeled image-text pairs.