prompt multiplicity: The phenomenon where competing prompt structures yield similar aggregate accuracy but generate conflicting individual predictions for the same question
ambiguity: The proportion of questions in a benchmark where the model outputs different choices depending on the prompt structure
self-consistency: For a specific question, the probability of getting the same output choice from two randomly chosen prompt structures
prompt-sensitive: A generation is prompt-sensitive if its self-consistency score is below a threshold (indicating randomness)
prompt-agnostic: A generation is prompt-agnostic if its self-consistency score is above a threshold (indicating persistent behavior)
RAG: Retrieval-Augmented Generation—systems that fetch external documents to ground answers
Med-HALT: A medical domain hallucination benchmark
TruthfulQA: A benchmark designed to measure whether language models generate falsehoods mimicking human misconceptions
predictive multiplicity: A concept from ML fairness where models with equal accuracy have different individual predictions; here adapted to prompt variations