VLM: Vision-Language Model—a model that can process and reason about both images and text
LPIPS: Learned Perceptual Image Patch Similarity—a metric measuring perceptual similarity between two images
CLIP: Contrastive Language-Image Pre-training—a model aligned to match images with text descriptions, often used for semantic scoring
Likert scale: A rating scale used in questionnaires (e.g., 1 to 4) to quantify subjective opinions
Krippendorff’s Alpha: A statistical measure of the agreement (reliability) among annotators
SFT: Supervised Fine-Tuning—training a model on a labeled dataset
OOD: Out-of-Distribution—data that differs significantly from the data seen during training
Instruction Following (IF): How accurately the edited image reflects the text instructions
Visual Quality (VQ): How realistic and artifact-free the edited image looks
HPSv3: Human Preference Score v3—a prior text-to-image reward model that introduced uncertainty-aware ranking