self-consistency: A technique where an LLM generates multiple responses to a prompt, and the most frequent answer (majority vote) is selected as the final output
semantic entropy: A measure of uncertainty calculated by grouping semantically equivalent responses (e.g., 'Paris' and 'It is Paris') and computing entropy over these clusters rather than raw text
consortium voting: A multi-model extension of self-consistency where responses are sampled from a set of different LLMs and aggregated via majority vote
consortium entropy: A multi-model extension of semantic entropy where the uncertainty is calculated over the distribution of semantic clusters formed from responses of multiple LLMs
AUROC: Area Under the Receiver Operating Characteristic curve—a metric measuring how well a system distinguishes between correct and incorrect answers (higher is better)
AURAC: Area Under Rejection Accuracy Curve—measures the accuracy of the system if it abstains from answering the most uncertain questions (higher is better)
hallucination: A phenomenon where an LLM generates a plausible-sounding but factually incorrect response
nucleus sampling: A text generation strategy (Top-p) where the next token is sampled from the smallest set of top vocabulary tokens whose cumulative probability exceeds p
chain-of-thought: A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer