Demonstrations (demos): Few-shot examples of <input, tool-plan> pairs provided in the prompt to teach the model how to use tools
Documentation (docs): Textual descriptions of a tool's functionality, inputs, and parameters (similar to a README or man page) provided in the prompt
VisProg: Visual Programming—a framework where LLMs generate python-like modular programs to solve visual tasks
GroundingDINO: An open-set object detector that can detect objects based on arbitrary text descriptions
SAM: Segment Anything Model—a promptable image segmentation model capable of generating masks for any object
XMem: A video object segmentation model used for tracking objects across video frames
TF-IDF: Term Frequency-Inverse Document Frequency—a statistical method used here to retrieve relevant tool documentation based on the input query
Zero-shot: Evaluating the model's ability to solve a task without seeing any specific examples (demonstrations) of that task in the prompt