OVD: Open-Vocabulary Detection—detecting objects described by arbitrary text, including categories not seen during training.
REC: Referring Expression Comprehension—locating a specific object in an image described by a natural language expression (e.g., 'the man in the red shirt').
PG: Phrase Grounding—linking multiple phrases in a caption to their corresponding object bounding boxes.
Grounding-DINO: A state-of-the-art open-set object detector that fuses text and image features using a Transformer-based architecture.
MMDetection: An open-source object detection toolbox based on PyTorch, part of the OpenMMLab project.
Zero-shot: The ability of a model to perform a task (like detecting a specific category) without having seen examples of that specific category during training.
Contrastive Embedding: A learning technique where the model learns to pull representations of matching image-text pairs closer and push non-matching pairs apart.
Bi-Attention: A mechanism that allows both text-to-image and image-to-text attention flow to fuse features from both modalities.