GCoT: Grounded Chain-of-Thought—a reasoning process where MLLMs output bounding box coordinates for relevant objects alongside text steps before answering.
Answer-Grounding Consistency: A metric measuring the percentage of samples where the model predicts *both* the correct text answer and the correct bounding box evidence.
Visual Hallucination: In this context, specifically refers to MLLMs generating correct answers based on language priors (bias) rather than actual visual perception.
IoU: Intersection over Union—a standard metric to evaluate the overlap between a predicted bounding box and the ground truth box.
SFT: Supervised Fine-Tuning—training a pre-trained model on a specific labeled dataset to adapt its behavior.
Visual Genome: A large-scale dataset providing detailed scene graphs, object bounding boxes, and attribute annotations used here to construct MM-GCoT.
Acc@0.5: Accuracy where a prediction is considered correct only if the Intersection over Union (IoU) with the ground truth is greater than 0.5.