OCR: Optical Character Recognition—technology that converts different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera, into editable and searchable data
ReAct: Reasoning and Acting—a paradigm where LLMs generate both reasoning traces (thoughts) and task-specific actions (tool calls) in an interleaved manner
Zero-shot: The ability of a model to perform a task without having seen any specific training examples for that task (here achieved via prompting)
Prompting: The process of structuring text input to an LLM to guide it to generate a desired output without updating model weights
Dense Captioning: A computer vision task that generates natural language descriptions for multiple specific regions of interest within an image
PaLM-E: A large embodied multimodal language model developed by Google that integrates vision and language through joint training
Regex: Regular Expression—a sequence of characters that specifies a search pattern, used here to parse tool calls from ChatGPT's text output