LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts

📝 Paper Summary

Radiology Report Generation (RRG) Medical Multi-modal LLMs

LLM-RG4 adapts to diverse clinical inputs (single/multi-view, with/without history) via adaptive token fusion while reducing input-agnostic hallucinations through a new dataset and token-level loss weighting.

Core Problem

Current radiology report generation models assume fixed inputs (usually single-image) and hallucinate uninferable details (like prior comparisons) when that information is missing.

Why it matters:

Clinicians adapt reports based on available data (e.g., comparing to prior exams if available), but models trained on fixed paradigms cannot flexibility adapt.
Generating uninferable information (hallucination) like 'no change from prior' when no prior exists reduces clinical trust and safety.
Existing methods to clean reports reduce information too drastically, while multi-view models fail when auxiliary inputs are missing.

Concrete Example: A model trained on standard datasets might generate 'Heart size is stable compared to prior' even when provided with only a single frontal image and no historical record, because such phrases are common in training data but factually uninferable from the single input.

Key Novelty

LLM-RG4: Adaptive Fusion and Dataset Construction

Creates a new dataset (MIMIC-RG4) using an LLM-based cyclic generation pipeline to ensure ground-truth reports perfectly match four specific input scenarios (e.g., removing comparisons if no prior history exists).
Uses an Adaptive Token Fusion module to compress visual and textual history features into a fixed token count, allowing the LLM to handle varying inputs without structural changes.
Employes a Token-Level Loss Weighting strategy that uses attribution maps to assign higher loss weights to positive or uncertain disease mentions, prioritizing clinical accuracy.

Architecture

The overall architecture of LLM-RG4, illustrating the processing of multi-modal inputs (frontal, lateral, history) through the modality encoder, adaptive token fusion, and the LLM decoder with token-level loss weighting.

Evaluation Highlights

Achieves state-of-the-art performance on MIMIC-RG4, outperforming baselines like X-rayGPT and Llava-Med in natural language generation metrics (e.g., +4.6% CIDEr on Single-View-No-Longitudinal).
Significantly reduces input-agnostic hallucinations; DiscBERT evaluation shows hallucination rates drop to near zero on MIMIC-RG4 compared to high rates in standard MIMIC-CXR.

Breakthrough Assessment

8/10

Strong contribution in defining a pragmatic clinical problem (diverse inputs) and solving it with both a rigorously constructed dataset and a flexible architecture. Bridges the gap between rigid model paradigms and flexible clinical practice.

⚙️ Technical Details

Problem Definition

Setting: Generate a radiology report T_c given a flexible set of inputs including frontal image I_f, optional lateral image I_l, optional prior report T_p, indication T_i, and history T_h.

Inputs: Frontal image I_f, optional lateral image I_l, optional prior report T_p, indication/history

Outputs: Current radiology report T_c (findings and impression)

Pipeline Flow

Modality Encoding (Vision/Text Encoders)
Adaptive Token Fusion (Compression & Concatenation)
LLM Decoding (Llama-3-8B)
Loss Calculation (Token-Level Weighting)

System Modules

Modality Encoder

Extract features from images and text history

Model or implementation: Visual: Frozen encoder (likely ViT based); Text: Frozen text encoder

Adaptive Token Fusion (ATF)

Compress and fuse multi-modal features into fixed-length tokens

Model or implementation: Perceiver Resampler + Linear Layers

LLM Decoder

Generate report from fused tokens and instructions

Model or implementation: Llama-3-8B-Instruct

Token-Level Loss Weighting

Adjust loss weights based on diagnostic importance

Model or implementation: CheXbert + Integrated Gradients

Novel Architectural Elements

Adaptive Token Fusion module that concatenates compressed features along the feature dimension (width) rather than token dimension (length), maintaining constant token count regardless of input modality availability.

Modeling

Base Model: Llama-3-8B-Instruct (implied, snippet mentions Llama3-70B for data generation but typically 8B is used for the RRG model in such contexts, though snippet doesn't explicitly name the RRG base model size, 'LLM-based RRG' usually implies a smaller deployable model. *Correction*: Snippet says 'Llama3-70B' is the *generator for the dataset*. The model architecture section describes 'LLM' generically but cites 'Llama3-70B' for data generation. Often these papers use 7B/8B for the student. I will omit specific size for the RRG model if not explicitly clear, but note Llama3-70B for data generation).

Training Method: Supervised Fine-Tuning with Token-Level Loss Weighting

Objective Functions:

Purpose: Optimize next-token prediction with emphasis on clinically relevant tokens.

Formally: Loss = - Sum (c_j * log P(t_j | T_<j, Inputs)), where c_j is the weight derived from CheXbert attributions.

Training Data:

MIMIC-RG4 dataset generated from MIMIC-CXR
Includes 4 scenarios: Single-View No-Longitudinal, Multi-View No-Longitudinal, Single-View Longitudinal, Multi-View Longitudinal

Key Hyperparameters:

loss_weight_lambda: lambda (value not explicitly in snippet, usually >1)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GILBERT/Nguyen et al.: These methods simply remove information, reducing report utility. LLM-RG4 adapts content based on input presence (rewriting vs removing).
vs. X-rayGPT/Llava-Med: LLM-RG4 specifically handles dynamic input combinations (longitudinal/multi-view) via adaptive fusion, whereas others typically expect fixed inputs.
vs. R2GenGPT [not cited in paper]: R2GenGPT optimizes for single-image inputs; LLM-RG4 extends this to multi-view and longitudinal integration.

Limitations

Dependency on the quality of the Llama3-70B generator for ground truth construction.
Requires pre-extracted text (prior reports) for longitudinal inputs, not raw prior images.
Complexity of the data generation pipeline (cyclic judgment/rewriting).

Reproducibility

Code: https://github.com/zh-Wang-Med/LLM-RG4

Code available at https://github.com/zh-Wang-Med/LLM-RG4. MIMIC-RG4 dataset is derived from MIMIC-CXR. DiscBERT and CheXbert are used as auxiliary tools.

📊 Experiments & Results

Evaluation Setup

Report generation on MIMIC-CXR and MIMIC-RG4 under different input scenarios.

Benchmarks:

MIMIC-CXR (Radiology Report Generation)
MIMIC-RG4 (Radiology Report Generation (Flexible Inputs)) [New]

Metrics:

NLG metrics (BLEU, METEOR, ROUGE-L, CIDEr)
Clinical Efficacy (CE) metrics (Precision, Recall, F1 via CheXbert)
Input-agnostic hallucination rate (via DiscBERT)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

LLM-RG4 achieves state-of-the-art performance in both clinical efficiency (CE) and natural language generation (NLG) on MIMIC-RG4 and MIMIC-CXR.
The adaptive token fusion module effectively handles varying input combinations without increasing token count, minimizing computational burden.
Token-level loss weighting improves clinical accuracy by prioritizing positive and uncertain findings.
The MIMIC-RG4 dataset successfully mitigates input-agnostic hallucinations compared to raw MIMIC-CXR, as verified by DiscBERT.

📚 Prerequisite Knowledge

Prerequisites

Radiology Report Generation (RRG)
Multi-modal Large Language Models (MLLM)
Attention mechanisms and Tokenization

Key Terms

MIMIC-RG4: A newly constructed dataset derived from MIMIC-CXR, specifically curated to align report content with four distinct input scenarios (e.g., removing prior comparisons if no prior report is input).

Input-agnostic hallucination: Generated text that describes information not present in the input sources, such as mentioning a prior procedure or comparison when no history is provided.

Longitudinal information: Data from a patient's previous medical examinations, specifically previous radiology reports in this context.

DiscBERT: A BERT-based discriminator trained by the authors to classify whether a report contains specific types of information (e.g., comparisons, views), used for dataset cleaning and evaluation.

CheXbert: A standardized tool for extracting disease labels from radiology reports, used here to ensure diagnostic consistency during data generation and for loss weighting.

Adaptive Token Fusion (ATF): A module that compresses features from various available modalities (images, text) into a fixed sequence length to maintain consistent input to the LLM.

Token-level Loss Weighting (TLW): A training strategy that increases the loss penalty for tokens corresponding to positive or uncertain findings, identified via Integrated Gradients on CheXbert outputs.

Integrated Gradients: An interpretability method used here to calculate attribution scores for tokens contributing to disease classifications.

Perceiver: A neural network architecture used to map high-dimensional inputs to a lower-dimensional latent space using cross-attention.