Reproducible Synthetic Clinical Letters for Seizure Frequency Information Extraction

📝 Paper Summary

Clinical Information Extraction Synthetic Data Generation Privacy-Preserving NLP

A framework using GPT-5 to generate task-faithful synthetic clinical letters with structured reasoning enables open-weight models to robustly extract seizure frequency from real private text.

Core Problem

Seizure frequency is critical for epilepsy care but is documented in highly variable, unstructured private text containing complex temporal patterns (ranges, clusters) that are difficult to extract and share.

Why it matters:

Manual extraction is labor-intensive and error-prone due to implicit time anchors and diverse linguistic forms
Privacy regulations prevent sharing real clinical narratives, creating a bottleneck for training and benchmarking high-performance NLP models
Existing de-identification methods leave residual privacy risks, limiting the reproducibility of clinical extraction systems

Concrete Example: A letter might state 'clusters twice monthly, 3 per cluster' or 'seizure free for 6 months'. Standard extractors struggle to normalize these ranges and cluster patterns into a single frequency, and real examples cannot be shared for training.

Key Novelty

Synthetic-Letter Framework with Structured Reasoning Supervision

Generates full synthetic clinic letters using GPT-5, conditioned on structured label templates (rates, ranges, clusters) to ensure diverse, task-faithful linguistic patterns
Embeds 'chain-of-thought' supervision (rationales and evidence spans) into the synthetic data, allowing student models to learn the reasoning behind complex temporal normalizations

Architecture

The synthetic data generation framework: from privacy-checked base letters and structured descriptions to full synthetic letters via GPT-5

Evaluation Highlights

Models trained purely on 15,000 synthetic letters achieved 0.858 micro-F1 (Pragmatic grouping) on a held-out set of real clinical letters
Structured label targets consistently outperformed direct numeric regression for frequency extraction
A medically oriented 4B parameter model (MedGemma-4B-it) matched the performance of larger general-purpose models when trained on the synthetic corpus

Breakthrough Assessment

8/10

Demonstrates that fully synthetic data from a high-capacity teacher (GPT-5) can effectively substitute for real private data in a complex clinical extraction task, solving a major privacy bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Information extraction from unstructured clinical narratives

Inputs: Neurology clinic letter text containing free-text descriptions of seizure history

Outputs: Normalized seizure frequency (seizures per month) and structured categorical labels (e.g., seizure-free, clusters)

Pipeline Flow

Input Letter → Student LLM → [Rationale + Structured Label] → Output

System Modules

Seizure Frequency Extractor

Extract seizure frequency and provide evidence

Model or implementation: Open-weight LLMs (e.g., Qwen2.5-14B, MedGemma-4B-it)

Modeling

Base Model: Qwen2.5-7B/14B, Gemma-3-4B, MedGemma-4B, Lingshu-7B, Llama-3.1-8B, Ministral-8B

Training Method: Supervised Fine-Tuning (SFT) on synthetic data

Training Data:

15,099 synthetic letters generated by GPT-5
1,481 real KCH letters (used for baseline comparison, not main synthetic approach)

Compute: Not reported in the paper

Reproducibility

The paper emphasizes the release of the synthetic dataset framework to enable reproduction without sharing real patient data. Code availability is not explicitly provided in the text. Base models are open-weight.

📊 Experiments & Results

Evaluation Setup

Extraction of seizure frequency from held-out real-world clinical letters

Benchmarks:

KCH Real Clinic Letters (Information Extraction) [New]

Metrics:

Micro-F1 (Fine-grained)
Micro-F1 (Pragmatic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Models trained purely on synthetic data generalize well to real world data, with medically-oriented models performing particularly well.
KCH Real Clinic Letters	Micro-F1 (Fine-grained)	0.788	0.787	-0.001
KCH Real Clinic Letters	Micro-F1 (Pragmatic)	0.847	0.858	+0.011

Main Takeaways

Models trained entirely on synthetic letters generalize effectively to real-world clinical letters, validating the task-faithful synthetic generation approach.
Structured label targets (categorical bins) consistently outperform direct numeric regression for extracting seizure frequency.
Evidence-grounded outputs (rationales + spans) facilitate clinical verification and error analysis.

📚 Prerequisite Knowledge

Prerequisites

Clinical Information Extraction
Large Language Models (LLMs)
Synthetic Data Generation
Knowledge Distillation

Key Terms

GPT-5: A high-capacity frontier language model used in this paper as a teacher to generate synthetic data and reasoning traces

chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer, used here as supervision for student models

micro-F1: A performance metric aggregating precision and recall globally across all classes, giving equal weight to each instance

Pragmatic grouping: A classification scheme grouping seizure frequencies into clinically meaningful bins (e.g., seizure-free, monthly, weekly)

Fine-grained grouping: A detailed classification scheme distinguishing specific frequency counts and ranges

teacher model: A large, powerful model (here GPT-5) used to generate training data or supervision for smaller 'student' models

NHS: National Health Service (UK public healthcare system)