Continual pre-training: Taking a model already trained on general text and training it further on domain-specific data (here, medical text) to specialize it.
Minerva: A foundational Large Language Model (LLM) based on Mistral, trained from scratch on Italian and English data.
MedMCQA: A large-scale multiple-choice question answering dataset designed to simulate medical entrance exams.
MedMCQA-ITA: An Italian translation of the MedMCQA dataset created by the authors using neural machine translation for evaluation.
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects (STEM, humanities, etc.) to test general knowledge.
HELLASWAG: A dataset for testing commonsense reasoning by asking the model to complete a sentence describing a situation.
ARC: AI2 Reasoning Challenge—a dataset of grade-school science questions.
Chinchilla-optimal: Refers to a specific ratio of model size to training data size that theoretically maximizes performance for a given compute budget.
bfloat16: Brain Floating Point 16—a number format that uses 16 bits but keeps the same dynamic range as 32-bit float, useful for stable ML training.
Gradient accumulation: A technique to simulate a larger batch size by accumulating gradients over multiple steps before updating model weights.
Adam: Adaptive Moment Estimation—a standard optimization algorithm for training deep learning models.