Clinical information extraction for lower-resource languages and domains with few-shot learning using pretrained language models and prompting

📝 Paper Summary

Clinical Information Extraction Low-resource NLP (German Medical Text) Few-shot Learning

For German clinical section classification with minimal data, a general-domain PLM adapted to the medical domain via further-pretraining and prompted with PET significantly outperforms traditional sequence classifiers.

Core Problem

Extracting information from unstructured clinical text (doctor's letters) is difficult due to high annotation costs, strict privacy regulations preventing external model usage, and limited compute resources in hospitals.

Why it matters:

Clinical data is highly sensitive and must be processed on-premise, often barring the use of powerful API-based LLMs like GPT-4
German clinical NLP is a lower-resource domain compared to English, lacking the massive annotated datasets required for standard supervised learning
Medical professionals require transparent, interpretable model decisions, which black-box deep learning models often fail to provide

Concrete Example: A standard sequence classifier trained on only 20 examples fails to distinguish 'Anamnese' (history) from 'Zusammenfassung' (summary) because they share tokens like 'patient' and 'admission'. The proposed prompt-based approach with domain adaptation correctly classifies these by leveraging contextual patterns and structural knowledge.

Key Novelty

Domain-Adapted Prompting with PET (Pattern-Exploiting Training)

Combine lightweight Prompt-Based Learning (PET) with Further-Pretraining on domain-specific clinical text to maximize performance from minimal labeled examples (few-shot)
Demonstrate that starting with a general-language model and adapting it to the clinical domain works better than starting with a specialized medical model pretrained from scratch on limited data
Use Shapley values not just for explanation, but to identify and correct training data biases (e.g., specific tokens acting as false shortcuts)

Architecture

The PET (Pattern-Exploiting Training) workflow applied to clinical text

Evaluation Highlights

With only 20 training shots per class, the proposed PET approach achieves 79.1% accuracy, outperforming a traditional sequence classifier (48.6%) by +30.5 percentage points
Further-pretraining a general German model (gbert) on clinical data yields better few-shot performance than using a model pretrained on medical data from scratch (medbertde)
Using Shapley values for model selection and optimization further boosts accuracy to 84.3% in the 20-shot setting

Breakthrough Assessment

7/10

Strong pragmatic contribution for low-resource clinical NLP. Demonstrates that adapting general models is superior to specialized small models for German, and effectively integrates interpretability for model improvement.

⚙️ Technical Details

Problem Definition

Setting: Multi-class classification of paragraphs from doctor's letters into predefined sections (e.g., Diagnosis, Medication) in a few-shot setting

Inputs: A paragraph text x (potentially with surrounding context paragraphs)

Outputs: A section label y from a set of 9 classes (e.g., Anamnese, Diagnosen, Medication)

Pipeline Flow

Further-Pretraining (Domain Adaptation)
Template & Verbalizer Definition (PETAL)
Few-Shot Fine-tuning (Pattern-Exploiting Training)
Ensemble Prediction & Distillation
Final Classification

System Modules

Domain Adapter

Adapt general PLMs to clinical domain via Masked Language Modeling

Model or implementation: gbert-base / gbert-large

Prompt Generator

Wrap input text in cloze-style templates

Model or implementation: N/A (Template logic)

PET Learner

Fine-tune PLM to predict masked tokens representing labels

Model or implementation: gbert-base-comb / gbert-large-comb

Final Classifier

Predict final class label

Model or implementation: BERT sequence classifier head

Novel Architectural Elements

Integration of context paragraphs (previous/subsequent) directly into the PET prompting structure to resolve ambiguity in short clinical segments

Modeling

Base Model: deepset/gbert-base (110M params) and deepset/gbert-large (340M params); Smanjil/German-MedBERT (medbertde)

Training Method: Pattern-Exploiting Training (PET) with Masked Language Modeling objective

Objective Functions:

Purpose: Adapt model to domain.

Formally: Masked Language Modeling (MLM) loss on clinical corpus
Purpose: Fine-tune for classification via prompting.

Formally: Cross-entropy loss on the probability of the verbalizer token at the [MASK] position

Adaptation: Full fine-tuning of the PLM

Training Data:

CARDIO:DE corpus: 500 doctor's letters, 49,258 paragraphs
Few-shot splits: 10, 20, 50, 100, 200, 400 samples per class
Pretraining data: 179k internal cardiology letters + GGPONC oncology guidelines

Key Hyperparameters:

max_seq_length: 512
batch_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
+ 1 more
training_time: Reasonable timeframe on 2 NVIDIA RTX6000 GPUs

Compute: Run on on-premise infrastructure with max 2 NVIDIA RTX6000 GPUs

Comparison to Prior Work

vs. SC: PET uses cloze-style objectives which bridge the gap between pretraining and fine-tuning, allowing effective learning with much less data (20 vs 400+ shots)
vs. medbertde (public): Shows that adapting a general German model (gbert) is superior to using a smaller or narrowly-pretrained medical model for this specific task
vs. GPT-3/LLMs [not cited in paper]: Focuses on small, privacy-compliant local models (BERT-base/large) rather than API-based LLMs which are restricted in clinical settings

Limitations

The 'medbertde' model did not benefit from further pretraining, possibly due to domain mismatch (oncology vs. cardiology) or limited initial pretraining data size
Analysis is limited to German language and cardiovascular domain; generalizability to other medical sub-domains is not fully tested
Requires access to unlabelled domain data for the further-pretraining step to be effective

Reproducibility

Code: https://github.com/smartschat/art

The annotated corpus (CARDIO:DE) is publicly available via heiData. The pretraining clinical data is private due to patient privacy. Code is built on the PET library. Specific hyperparameters for the final runs are not detailed in the text (referenced broadly as standard PET settings).

📊 Experiments & Results

Evaluation Setup

Multi-class section classification on the CARDIO:DE100 test set

Benchmarks:

CARDIO:DE (Section Classification (9 classes))

Metrics:

Accuracy (per model)
F1-score (per class)
Statistical methodology: Approximate randomization tests (p < 0.05)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Core comparison of PET vs. standard Sequence Classification (SC) in few-shot settings using base models.
CARDIO:DE	Accuracy	48.6	79.1	+30.5
CARDIO:DE	Accuracy	67.6	79.1	+11.5
Impact of model size and contextualization on performance.
CARDIO:DE	Accuracy	79.1	84.3	+5.2
CARDIO:DE	Accuracy	79.1	80.5	+1.4
CARDIO:DE	Accuracy	93.4	98.6	+5.2

Main Takeaways

PET is highly effective for few-shot clinical classification, requiring orders of magnitude less data than standard classifiers to reach acceptable performance
Further-pretraining general domain models (gbert) on clinical text is more effective than using models pretrained solely on medical text (medbertde) if the latter are smaller or from a different medical sub-domain
Contextualization (adding surrounding paragraphs) and using larger model variants (Large vs Base) provide additive gains, especially for unstructured sections like 'Anamnese'
Shapley values revealed that models sometimes relied on spurious tokens (e.g., 'Patient') for classification; contextualization helped shift focus to meaningful section headers

📚 Prerequisite Knowledge

Prerequisites

Understanding of Masked Language Models (BERT architecture)
Familiarity with Prompt-based Learning (PET)
Basic knowledge of Shapley values for interpretability

Key Terms

PET: Pattern-Exploiting Training—a semi-supervised approach that reformulates classification as a cloze-style (fill-in-the-blank) language modeling task

PETAL: PET with Automatic Labels—a variant of PET that automatically finds the best verbalizer (token mapping) for labels, reducing manual engineering

Further-pretraining: Taking an existing pretrained language model and training it further on domain-specific unlabeled text (e.g., clinical letters) before fine-tuning

Verbalizer: A mapping function in prompting that converts a class label (e.g., 'Positive') into a token in the model's vocabulary (e.g., 'good')

Shapley values: A game-theoretic method to attribute the contribution of each input feature (token) to the final model prediction

Cloze question: A test where a participant is asked to supply a missing word, used here as the prompt format (e.g., '... This is [MASK].')

Sequence Classifier (SC): A standard BERT-based classification approach adding a linear layer on top of the [CLS] token, used here as the baseline

gbert: A German-language BERT model pretrained on general domain text

medbertde: A German-language BERT model pretrained on medical and clinical text