Codex: A large language model fine-tuned on code, capable of translating natural language instructions into executable programming code
API: Application Programming Interface—here, a set of defined Python functions (like find() or compute_depth()) that the LLM calls to use vision tools
Zero-shot: The ability of a model to perform a task without having explicitly trained on examples of that specific task
GLIP: Grounded Language-Image Pre-training—a model used here for detecting objects specified by text (e.g., finding 'muffins')
BLIP-2: A vision-language model used here for answering simple visual questions about image patches
MiDaS: A model used for estimating depth (distance from camera) for every pixel in an image
IoU: Intersection over Union—a metric for measuring the accuracy of an object detector on a particular dataset
Visual Grounding: The task of locating the specific region or bounding box in an image that corresponds to a textual description