SimpleQA: A benchmark for evaluating the factual consistency of language models, focusing on short, fact-seeking questions where the answer is a specific entity or date
SimpleWikiQA: A subset of SimpleQA created by the authors where questions are grounded in specific Wikipedia documents to test expert domain adaptation
FinanceBench: A question answering benchmark grounded on financial disclosure documents, used to test expert domain knowledge
Self-BLEU: A metric measuring diversity in generated text by calculating the BLEU score of a generated sentence against other generated sentences from the same source; lower scores indicate higher diversity
Guardrail metrics: Standard benchmarks (like NaturalQuestions or TriviaQA) used to ensure a model hasn't lost general capabilities or previous knowledge while learning new specific information
Catastrophic forgetting: The tendency of neural networks to lose previously learned information upon learning new information
Active Reading: The proposed framework where an LLM generates its own strategies (e.g., timelines, analogies) to process a document and create synthetic training data
Mid-training: A training phase between pre-training and fine-tuning, often used to inject domain-specific knowledge or align the model before the final task adaptation
RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents