Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian

📝 Paper Summary

Biomedical NLP Low-resource language modeling

Igea adapts the Italian-English Minerva foundation model into a specialized biomedical generator for Italian by continually pre-training on a diverse corpus of translated abstracts, textbooks, and web data.

Core Problem

General-purpose Italian language models lack the specialized terminology required for medical accuracy, while existing biomedical models are predominantly English-centric.

Why it matters:

Medical communication requires high precision and clarity; generic models often hallucinate or misuse terminology.
Significant disparity exists in NLP resources between English and other languages like Italian, hindering clinical adoption in non-English speaking regions.
Previous Italian biomedical efforts (e.g., BioBIT) were small BERT-based encoders unsuitable for generative tasks.

Concrete Example: A general Italian LLM might describe a medical condition using lay terms or inaccurate translations, whereas Igea is trained to use formal scientific lexicon appropriate for clinical documentation or research.

Key Novelty

Continual Pre-training for Italian Biomedical Generation

Continually pre-trains a general-purpose Italian/English model (Minerva) on a curated 5-billion-word corpus of Italian medical text.
Combines diverse data sources—formal textbooks, translated PubMed abstracts, and layman web discussions—to capture both technical and patient-facing registers.
Uses a cosine learning rate schedule with warmup to adapt the model to the new domain without catastrophically forgetting general language capabilities.

Evaluation Highlights

Igea 3B achieves 31.3% accuracy on MedMCQA-ITA, outperforming the base Minerva 3B model (29.3%) on this domain-specific task.
Retains general knowledge with competitive scores on Italian MMLU (34.3% vs Minerva's 36.2%) despite heavy domain adaptation.
Demonstrates scaling behavior where larger models (3B) consistently outperform smaller variants (350M, 1B) across both medical and general benchmarks.

Breakthrough Assessment

5/10

Significant as the first generative biomedical LLM for Italian, filling a major resource gap. Methodologically standard (continual pre-training), but the resulting artifacts and dataset (MedMCQA-ITA) are valuable contributions.

⚙️ Technical Details

Problem Definition

Setting: Generative language modeling (Next Token Prediction) specialized for the Italian biomedical domain.

Inputs: Italian biomedical text prompt (question, incomplete sentence, clinical note)

Outputs: Generated continuation or answer in Italian

Pipeline Flow

Input (Italian Medical Text) → Tokenizer → Igea Model (Decoder-only Transformer) → Output (Next Token Probabilities)

System Modules

Igea Model

Generates biomedical text continuations or answers

Model or implementation: Minerva (Mistral-based architecture) + Continual Pre-training

Modeling

Base Model: Minerva (Mistral architecture, trained on 200B tokens Italian/English)

Training Method: Continual Pre-training (Causal Language Modeling)

Trainable Parameters: All parameters (Full model update)

Training Data:

Total corpus: >5 billion words
Source 1: Translated PubMed Abstracts (4.377 billion words)
Source 2: Italian Web medical sources (0.659 billion words)
Source 3: Medical Textbooks (0.0023 billion words)
Source 4: Italian Wikipedia 'Medicine' portal

Key Hyperparameters:

learning_rate: 5e-5
optimizer: Adam (beta1=0.9, beta2=0.95)
scheduler: Cosine with warmup
+ 3 more
warmup_ratio: 0.02
precision: bfloat16
epochs: 1

Compute: Single 8xA100 40GB node with gradient accumulation

Comparison to Prior Work

vs. BioBIT: Igea is generative (decoder-only) and larger (up to 3B vs 110M), whereas BioBIT is encoder-only.
vs. Minerva: Igea is specialized on medical text via continual pre-training.
vs. MedPaLM/BioMedLM: Igea targets Italian, whereas these are English-centric.
+ 1 more
vs. LLaMAntino [not cited in paper]: Igea uses Minerva (Mistral) base and focuses on medical domain, whereas LLaMAntino adapts LLaMA 2 for general Italian.

Limitations

Evaluation is limited by the lack of native Italian biomedical benchmarks; relies on machine-translated MedMCQA.
Models are relatively small (max 3B) compared to state-of-the-art English medical models (70B+).
Potential for bias and PII leakage from web/textbook training data.
Not safe for clinical decision-making without further alignment/instruction tuning.

Reproducibility

Code: https://huggingface.co/bmi-labmedinfo

Models (Igea 350M, 1B, 3B) available on HuggingFace. MedMCQA-ITA dataset created via translation but not explicitly linked as a standalone download (though referenced as a contribution). Training hyperparameters provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot evaluation on multiple-choice QA and reasoning tasks.

Benchmarks:

MedMCQA-ITA (Biomedical Question Answering) [New]
MMLU-IT (General Knowledge (Multi-task))
ARC-IT (Science Reasoning)
HELLASWAG-IT (Commonsense Reasoning)

Metrics:

Accuracy (Normalized)
Loss
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Biomedical domain performance shows Igea improving over the general-purpose base model.
MedMCQA-ITA (0-shot)	Accuracy	29.3	31.3	+2.0
General knowledge retention tests show only minor degradation after medical training.
MMLU_IT (5-shot)	Average Accuracy	36.2	34.3	-1.9
HELLASWAG_IT (0-shot)	Accuracy	51.9	49.1	-2.8
ARC_IT (0-shot)	Accuracy	26.1	28.7	+2.6
Scaling laws are observed within the Igea family.
MedMCQA-ITA (0-shot)	Accuracy	25.0	31.3	+6.3

Experiment Figures

Accuracy and Loss curves on the held-out evaluation set for Igea 3B, 1B, and 350M.

Main Takeaways

Domain adaptation succeeds: Igea outperforms its base model (Minerva) on the medical task (MedMCQA-ITA).
Knowledge retention is robust: General benchmarks (MMLU, HELLASWAG) show only minor drops or even improvements (ARC), indicating no severe catastrophic forgetting.
Model scaling matters: The 3B parameter model consistently outperforms 350M and 1B versions across all tasks.
The lack of native Italian biomedical benchmarks necessitates translation-based evaluation, which introduces noise (12.5% of training / 47% of eval discarded due to errors).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Continual pre-training / Domain adaptation
Language Model evaluation benchmarks (MMLU, HELLASWAG)

Key Terms

Continual pre-training: Taking a model already trained on general text and training it further on domain-specific data (here, medical text) to specialize it.

Minerva: A foundational Large Language Model (LLM) based on Mistral, trained from scratch on Italian and English data.

MedMCQA: A large-scale multiple-choice question answering dataset designed to simulate medical entrance exams.

MedMCQA-ITA: An Italian translation of the MedMCQA dataset created by the authors using neural machine translation for evaluation.

MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects (STEM, humanities, etc.) to test general knowledge.

HELLASWAG: A dataset for testing commonsense reasoning by asking the model to complete a sentence describing a situation.

ARC: AI2 Reasoning Challenge—a dataset of grade-school science questions.

Chinchilla-optimal: Refers to a specific ratio of model size to training data size that theoretically maximizes performance for a given compute budget.

bfloat16: Brain Floating Point 16—a number format that uses 16 bits but keeps the same dynamic range as 32-bit float, useful for stable ML training.

Gradient accumulation: A technique to simulate a larger batch size by accumulating gradients over multiple steps before updating model weights.

Adam: Adaptive Moment Estimation—a standard optimization algorithm for training deep learning models.