Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments

📝 Paper Summary

Medical Vision-Language Models (Med-VLM) 3D Brain MRI Analysis Longitudinal Disease Progression Modeling

LoV3D trains a 3D vision-language model to produce verifiable, structured clinical reports for longitudinal brain MRI by using a clinically-weighted Verifier to drive Direct Preference Optimization without human annotations.

Core Problem

Current tools fragment diagnosis: classifiers produce labels without reasoning, volumetric tools (like FreeSurfer) give numbers without interpretation, and VLMs generate fluent but hallucinated text that is hard to verify.

Why it matters:

Clinical reports require layered reasoning (anatomical observations, longitudinal comparison, clinical context), not just a final label, to be trusted by neuroradiologists.
Existing VLMs can describe atrophy in healthy patients because they lack anatomical grounding, and no algorithm can detect these hallucinations from free text alone.
Deep learning classifiers discard anatomical specificity, while volumetric pipelines lack the reasoning capabilities to synthesize findings into a diagnosis.

Concrete Example: A generalist VLM might describe hippocampal atrophy in a patient whose hippocampus is actually normal. Because the output is free text, no algorithm can flag this error automatically. In contrast, LoV3D outputs structured JSON where the label 'normal' is cross-checked against the reasoning text and longitudinal history.

Key Novelty

Closed-loop verifiable training via structured outputs and automated DPO

Design the model output as structured JSON where fields (anatomy, diagnosis, reasoning) have explicit logical constraints checkable by code, rather than just free text.
Use a 'Clinically-Weighted Verifier' that scores generated outputs against ground-truth volumetric data (derived from FreeSurfer but never shown to the model input) to create preference pairs.
Train the reasoning process using Direct Preference Optimization (DPO) based on these automated scores, eliminating the need for human preference labeling.

Architecture

End-to-end training pipeline and inference workflow, showing the progression from Stage 0 to Stage 2.

Evaluation Highlights

93.7% three-class diagnostic accuracy (CN/MCI/Dementia) on ADNI test set, with zero non-adjacent errors (e.g., no CN identified as Dementia).
97.2% two-class accuracy (CN vs. Dementia), outperforming SOTA binary classifiers by +4% on the same split.
82.6% region-level anatomical classification accuracy, improving +33.1% over generalist VLM baselines (RadFM, M3D-LaMed).

Breakthrough Assessment

9/10

Strong contribution by solving the VLM hallucination problem via structured verification and automated DPO. Achieves SOTA on ADNI and zero-shot transfer to external datasets without human annotation.

⚙️ Technical Details

Problem Definition

Setting: Longitudinal 3D brain MRI analysis and diagnostic reasoning

Inputs: Current T1-weighted MRI scan, prior scan history (if available), demographics, APOE status, and cognitive scores (MMSE, CDR-SB)

Outputs: Structured JSON containing anatomical assessment (region-level), longitudinal comparison, clinical reasoning, diagnosis (CN/MCI/Dementia), and a natural language summary

Pipeline Flow

Visual Encoder (3D CNN) → Projector → LLM (with LoRA) → Structured Output
Verifier (Training only): Output → Logic Checks & Ground Truth Comparison → DPO Signal

System Modules

Visual Encoder (Input Processing)

Extract 3D features from the MRI scan

Model or implementation: MONAI ResNet-50 (truncated after layer 3)

Projector (Input Processing)

Map visual features to LLM embedding space

Model or implementation: Two-layer MLP with GELU activation

LLM Backbone

Process multimodal inputs and generate structured clinical report

Model or implementation: Qwen-2.5-14B-Instruct with LoRA adapters

Clinically-Weighted Verifier

Score generated outputs against constraints and hidden ground truth to create DPO pairs

Model or implementation: Rule-based system with clinical weights

Novel Architectural Elements

Verifiable Output Interface: A structured JSON schema designed specifically to enable algorithmic checking of reasoning consistency and biological plausibility (e.g., neurodegeneration irreversibility constraints)
Automated Preference Loop: A pipeline topology where a rule-based Verifier generates preference pairs from model samples using hidden ground-truth data, removing human annotators from the DPO loop

Modeling

Base Model: Qwen-2.5-14B

Training Method: Direct Preference Optimization (DPO) guided by a Clinically-Weighted Verifier

Objective Functions:

Purpose: Train the projector to map visual tokens to LLM space.

Formally: Causal LM loss (Stage 1a)
Purpose: Teach the model to produce structured JSON and clinical reasoning.

Formally: Supervised Fine-Tuning (SFT) loss on ground truth reports (Stage 1b)
Purpose: Optimize the model to prefer clinically accurate and consistent outputs.

Formally: DPO loss L_DPO = -E[log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]

Adaptation: LoRA (rank=16, α=32)

Trainable Parameters: Projector + LoRA adapters

Training Data:

ADNI dataset: 3,993 train / 525 val / 479 test scans
Ground truth labels derived from FreeSurfer volumes using a normative Z-score model fitted on CN subjects

Key Hyperparameters:

beta: 0.1
temperature: 0.7 (for sampling candidates)
K: 4 (candidates per sample)
+ 2 more
visual_token_dim: 5120
image_size: 128^3

Compute: Single A100-80GB GPU

Comparison to Prior Work

vs. RadFM/M3D-LaMed: LoV3D produces valid structured JSON and grounded reasoning (vs. 0% valid JSON and high hallucinations)
vs. ResNet-50: LoV3D integrates clinical metadata and reasoning, achieving 93.7% 3-class accuracy vs 58.9%
vs. Binary Classifiers: LoV3D performs 3-class classification (CN/MCI/Dem) and provides interpretable reasoning, not just a label
+ 2 more
vs. LLaVA-Med [not cited in paper]: LLaVA-Med is 2D-only; LoV3D handles longitudinal 3D volumetric data natively
vs. CheXagent [not cited in paper]: LoV3D uses automated verification logic for DPO instead of human or GPT-4 feedback

Limitations

Relies on FreeSurfer for ground truth training signals, inheriting its segmentation errors/limitations.
Z-score thresholds for 'mild' vs 'severe' atrophy are heuristic, though shown to be robust.
Requires high-quality T1-weighted MRI inputs; performance on lower quality scans not explicitly analyzed.

Reproducibility

Code: https://github.com/Anonymous-TEVC/LoV-3D

Code is publicly available (https://github.com/Anonymous-TEVC/LoV-3D). Data uses standard ADNI, MIRIAD, and AIBL datasets. Ground truth generation requires FreeSurfer. No proprietary models required (uses open weights Qwen-2.5).

📊 Experiments & Results

Evaluation Setup

Classification and report generation on longitudinal brain MRI

Benchmarks:

ADNI (3-class Diagnosis (CN/MCI/Dementia) & Anatomical Assessment)
MIRIAD (Zero-shot Diagnosis Transfer)
AIBL (Zero-shot Diagnosis Transfer)

Metrics:

Diagnostic Accuracy (3-class and 2-class)
Macro F1
Cohen's weighted Kappa
Region-level anatomical accuracy
JSON validity rate
ROUGE-L / BLEU-4 (for summaries)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LoV3D significantly outperforms baselines on ADNI diagnostic and anatomical accuracy.
ADNI	3-class Diagnostic Accuracy	58.9	93.7	+34.8
ADNI	2-class Diagnostic Accuracy (CN vs Dem)	93.0	97.2	+4.2
ADNI	Region Accuracy	49.5	82.6	+33.1
MIRIAD	Accuracy	93.6	95.4	+1.8
AIBL	Accuracy	76.4	82.9	+6.5
ADNI	False Severe Rate	4.1	2.2	-1.9

Main Takeaways

Generalist 3D VLMs (RadFM, M3D-LaMed) fail completely at structured clinical reporting (0% valid JSON), while LoV3D achieves 100% validity.
Anatomical grounding (Stage 0 pre-training) is critical: removing it drops accuracy and introduces critical CN-to-Dementia errors.
Automated DPO improves report quality (BLEU-4 +65%) and reduces hallucinations (False Severe Rate -46%) compared to SFT alone.
Zero-shot transfer to MIRIAD and AIBL confirms robustness across different scanners and populations without fine-tuning.

📚 Prerequisite Knowledge

Prerequisites

Basics of Vision-Language Models (VLMs) and LoRA fine-tuning
Reinforcement Learning from Human Feedback (RLHF) / Direct Preference Optimization (DPO)
Alzheimer's Disease pathology (hippocampal atrophy, ventricular enlargement)
Medical imaging basics (T1-weighted MRI, FreeSurfer segmentation)

Key Terms

DPO: Direct Preference Optimization—a method to align language models to preferences by optimizing on chosen/rejected pairs without a separate reward model

FreeSurfer: A software suite for processing and analyzing human brain MRI images, used here to generate ground-truth volumetric data

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

ICV: Intracranial Volume—a normalization factor used to correct head size differences in brain volume measurements

CN: Cognitively Normal—a diagnostic category for individuals with no signs of cognitive impairment

MCI: Mild Cognitive Impairment—a stage between expected cognitive decline of normal aging and the more serious decline of dementia

ADNI: Alzheimer's Disease Neuroimaging Initiative—a large, multicenter longitudinal study dataset

Z-score: A statistical measurement describing a value's relationship to the mean of a group, used here to quantify volumetric deviation from normative references

Hallucination: A phenomenon where an AI model generates plausible-sounding but factually incorrect information

SFT: Supervised Fine-Tuning—training a model on a labeled dataset before applying preference optimization