Inconsistency Detection (ID): The task of determining whether a summary contains facts not supported by or contradicting the source document
QAFactEval: A specialized non-LLM metric that checks consistency by generating questions from the summary and verifying if the document answers match
SUMM EDITS: The new benchmark proposed in this paper, consisting of document-summary pairs with atomic edits labeled for consistency
Inter-Annotator Agreement (IAA): A statistical measure (like Cohen's Kappa) of how much multiple human annotators agree on labels, used here to validate benchmark quality
Atomic Edits: Small, localized changes to a text (like swapping a date or entity) rather than rewriting the whole text, used to create controlled test cases
Chain-of-Thought (CoT): A prompting strategy asking the model to generate step-by-step reasoning before the final answer
Balanced Accuracy: The arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate), used to evaluate performance on imbalanced datasets