medical hallucination: Any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions
Chain-of-Thought (CoT): A prompting strategy that encourages the model to generate intermediate reasoning steps before producing a final answer
autoregressive training: Training models to predict the next token in a sequence based on previous tokens, optimizing for likelihood rather than factual correctness
retrieval-augmented generation (RAG): A technique where models retrieve relevant external documents to ground their responses
foundation models: Large-scale AI models trained on vast amounts of data that can be adapted to various downstream tasks
FDR correction: False Discovery Rate correction—a statistical method to adjust p-values when performing multiple comparisons to reduce false positives
MedQA: A benchmark dataset for medical question answering, often used to evaluate clinical knowledge
MedMCQA: A large-scale multi-choice question answering dataset derived from medical entrance exams
PubMedQA: A biomedical question answering dataset collected from PubMed abstracts
Mann–Whitney U test: A non-parametric statistical test used to compare differences between two independent groups when the dependent variable is essentially ordinal or continuous but not normally distributed