VLM: Vision-Language Model—AI models that can process both images and text to perform tasks like captioning or visual question answering
Elo rating: A rating system calculated from pairwise win/loss records to estimate the relative skill levels of competitors (originally from chess)
Bradley-Terry model: A statistical model used to predict the outcome of a pairwise comparison, used here to convert win/loss data into model scores
VLM-as-a-Judge: Using a strong VLM (like GPT-4o) to evaluate and rank the outputs of other models, essentially automating the role of a human judge
DOCCI: Descriptions of Connected Images—a dataset containing images with high-quality, long, human-annotated descriptions used as the source for evaluation
Hallucination: When a model generates descriptions of objects or details that are not actually present in the image
CLIPScore: A metric that measures the semantic similarity between an image and a caption using embeddings from the CLIP model; found here to be ineffective for detailed captions
METEOR: Metric for Evaluation of Translation with Explicit ORdering—a rule-based metric based on the harmonic mean of unigram precision and recall