Chain-of-thought prompting: Prompting a language model to generate a series of short sentences describing the reasoning steps before the final answer
Greedy decoding: A decoding strategy where the model always selects the token with the highest probability at each step
Self-consistency: A decoding strategy that samples multiple reasoning paths and selects the answer that appears most frequently (majority vote)
Marginalization: In this context, summing the probabilities of different reasoning paths that lead to the same final answer to find the most likely answer
Temperature sampling: A sampling method where the logits are scaled by a temperature parameter T; higher T increases diversity
Top-k sampling: A sampling method that restricts the model to sample only from the k most likely next tokens
Nucleus sampling: A sampling method that restricts sampling to the smallest set of tokens whose cumulative probability exceeds a threshold p
GSM8K: Grade School Math 8K—a dataset of grade school math word problems
SVAMP: A challenge dataset for math word problems with varying linguistic structures
AQuA: Algebra Question Answering dataset
StrategyQA: A benchmark for implicit reasoning strategies
ARC: AI2 Reasoning Challenge—a dataset of grade-school science questions
Majority vote: Selecting the answer that occurs most frequently among the generated outputs