RealHall: A new benchmark suite proposed in this paper containing four difficult datasets (COVID-QA, DROP, Open Assistant, TriviaQA) to evaluate hallucination detection
ChainPoll: The proposed metric that prompts an LLM to reason about hallucination multiple times and aggregates the boolean 'yes/no' votes
Open-domain hallucination: False claims made by the LLM about the real world without reference documents (e.g., making up facts about a celebrity)
Closed-domain hallucination: Inconsistency between the LLM's generated text and a specific provided reference text (e.g., a summary contradicting the source article)
CoT: Chain-of-Thought—a prompting technique where the model is asked to generate intermediate reasoning steps before the final answer
AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings
RAG: Retrieval-Augmented Generation—providing external documents to an LLM to ground its answers
Pseudo-entropy: An approximation of Shannon entropy used as a baseline metric, adapted for APIs that only provide a subset of token probabilities