FGVR: Fine-Grained Visual Recognition—distinguishing between very similar sub-categories (e.g., different species of birds).
CoT SFT: Chain-of-Thought Supervised Fine-tuning—training a model on examples that include intermediate reasoning steps before the final answer.
TAPO: Triplet Augmented Policy Optimization—the proposed RL algorithm that uses triplets of images (anchor, positive, negative) to optimize the model's policy.
Intra-class variance: Visual differences between images of the same category (e.g., same bird in different poses/lighting).
Inter-class variance: Visual differences between images of different categories (often very subtle in FGVR).
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization—a specific RL algorithm for MLLMs that Fine-R1 builds upon.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from groups of outputs rather than a separate value network.
KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to force the model's predictions to differ significantly when the input image changes to a different sub-category.
Information Bottleneck: A method to extract the most relevant information while discarding noise, used here to select visual concepts for CoT data.