CoT: Chain of Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.
LLM: Large Language Model—a deep learning model trained on vast amounts of text to generate human-like language.
Auto-CoT: Automatic Chain of Thought—a baseline method that automatically generates reasoning demonstrations using 'Let's think step-by-step' prompts.
PeG: Parsing Expression Grammar—a type of analytic formal grammar used here to define the structure of the reasoning programs.
Zero-shot: The ability of a model to perform a task without seeing any specific training examples for that exact task (though ART uses related task examples).
Few-shot: Providing the model with a small number of example input-output pairs in the prompt to guide its generation.
Self-consistency: A technique where the model generates multiple reasoning paths and selects the most frequent answer to improve reliability.
MMLU: Massive Multitask Language Understanding—a benchmark designed to measure knowledge acquired during pretraining.
BigBench: A diverse, collaborative benchmark for measuring the capabilities and limitations of large language models.
Codex: An LLM fine-tuned on code, used in this paper as the engine for generating Python code snippets.