QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

📝 Paper Summary

Clinical Foundation Models Multimodal Large Language Models (MLLMs)

QoQ-Med is a generalist clinical model that integrates 1D, 2D, and 3D medical data using a novel reinforcement learning objective (DRPO) to balance performance across rare and difficult clinical domains.

Core Problem

Existing clinical multimodal models often fail to generalize across diverse specialties because abundant data domains (like Chest X-rays) dominate training, while rare domains (like ECGs) are neglected, and most models lack interpretable reasoning traces.

Why it matters:

Clinical decision-making requires integrating heterogeneous data (ECG, CT, text), but current models struggle to synergize these conflicting modalities
Black-box diagnostic models impede clinical adoption because healthcare professionals cannot verify the reasoning behind a diagnosis
Standard training methods like GRPO overfit to easy, abundant samples, leading to poor performance on hard, minority clinical tasks

Concrete Example: A patient record might include a 1D ECG, a 3D CT scan, and text notes. A standard MLLM might ignore the ECG due to its rarity in training data or provide a diagnosis without pointing to the specific image region (bounding box) that justifies the conclusion.

Key Novelty

Domain-aware Relative Policy Optimization (DRPO)

Applies a hierarchical scaling mechanism to reinforcement learning rewards: first clustering questions by difficulty within domains, then upweighting updates for rare domains and harder clusters
Introduces a multimodal architecture that natively integrates 1D time-series (via ECG-JEPA) with 2D/3D vision encoders and text, allowing simultaneous reasoning across all three data types

Architecture

The architecture of QoQ-Med and its inference flow involving multimodal inputs.

Evaluation Highlights

DRPO training boosts diagnostic performance by 43% in macro-F1 on average across 8 clinical vision modalities compared to standard GRPO
Achieves an Intersection-over-Union (IoU) score 10x higher than open models for highlighting salient regions, matching the performance of OpenAI o4-mini
Releases a dataset of 2.61 million instruction tuning pairs with reasoning traces across 9 clinical domains

Breakthrough Assessment

9/10

First open generalist model to integrate 1D time-series with 2D/3D imaging via a novel, theoretically grounded RL method (DRPO) that effectively solves the multi-domain imbalance problem.

⚙️ Technical Details

Problem Definition

Setting: Multimodal clinical diagnosis question answering with reasoning trace generation

Inputs: Clinical sample x_i = (patchified image, multichannel time-series, text input, domain indicator)

Outputs: Unsupervised reasoning trace, bounding boxes b_i highlighting evidence, and concise diagnosis y_hat

Pipeline Flow

Input Processing (Encoders)
Projection & Integration
Reasoning & Generation (LLM)

System Modules

Vision Encoder (Input Processing)

Encodes 2D and 3D visual data (Chest X-ray, CT, MRI, etc.) into patch embeddings

Model or implementation: Pretrained vision-language model encoder (SigLIP-So400m)

Time-Series Encoder (Input Processing)

Encodes 1D sensor data (ECG) into token representations

Model or implementation: ECG-JEPA

Projection Layers

Maps encoder outputs to the LLM's token space

Model or implementation: Linear projections

Reasoning Core

Interleaves multimodal tokens with text to generate reasoning traces, bounding boxes, and diagnoses

Model or implementation: Qwen-2.5-7B/32B (as base LLM)

Novel Architectural Elements

Integration of a dedicated time-series encoder (ECG-JEPA) alongside a standard vision encoder within a single MLLM architecture
Unified input sequence interleaving 1D (time-series), 2D/3D (vision), and text tokens for simultaneous processing

Modeling

Base Model: Qwen-2.5-7B and Qwen-2.5-32B

Training Method: Domain-aware Relative Policy Optimization (DRPO)

Objective Functions:

Purpose: Maximize expected reward while keeping the policy close to the reference model.

Formally: Maximize sum of min(r_t * A_hat, clip(...) * A_hat) - beta * KL_divergence.
Purpose: Reward diagnostic accuracy.

Formally: F1 score between predicted diagnosis and ground truth labels.
Purpose: Reward semantic alignment (grounding).

Formally: Intersection-over-Union (IoU) between predicted bounding boxes and ground truth segmentation masks.
Purpose: Scale advantages to balance domain difficulty.

Formally: Scale normalized advantages by inverse temperature factors derived from domain and cluster-level difficulty.

Training Data:

CLIMB dataset: 2.61 million samples across 33 datasets
Spans 9 domains: 1D (ECG), 2D (X-ray, Mammography, Dermoscopy, Pathology, Fundus), 3D (Ultrasound, MRI, CT)

Key Hyperparameters:

reward_weights: (accuracy=0.6, IoU=0.2, auxiliary=0.2)
clustering_method: K-means on reward vectors (k determined by elbow method)
beta: KL penalty coefficient (scalar)
+ 1 more
epsilon: Clipping constant for surrogate objective

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. LLaVa-Med/RadLM: QoQ-Med integrates 1D time-series (ECG) and covers 9 domains, whereas others are largely vision-centric (X-ray/Pathology)
vs. GEM: QoQ-Med aggregates multiple sources (1D+2D+3D) for comprehensive diagnosis, whereas GEM focuses only on ECG
vs. OpenAI o3/Deepseek R1: QoQ-Med is an open-weights model specifically fine-tuned for clinical reasoning with grounding (bounding boxes), unlike generalist closed models
+ 1 more
vs. PPO [not cited in paper]: DRPO is critic-free and computationally more efficient, avoiding the memory overhead of a separate value network

Limitations

Dependency on the quality of the CLIMB dataset and its ground truth labels
K-means clustering adds a small O(n) computational overhead compared to vanilla GRPO
Requires pre-trained encoders (like ECG-JEPA) which must be aligned in Stage 1
Performance gains vary by modality, though average improvement is high

Reproducibility

Code: https://github.com/QoQ-Med/QoQ-Med

Highly reproducible: Model weights (7B/32B), training pipeline code, and reasoning traces for all 2.61M training pairs are released. The dataset (CLIMB) is cited as a source. Specific compute hours or GPU counts are not detailed.

📊 Experiments & Results

Evaluation Setup

Multimodal diagnostic question answering and reasoning evaluation across diverse clinical domains.

Benchmarks:

8 Clinical Vision Modalities (Diagnostic QA)
Reasoning Trace Evaluation (Salient region grounding (IoU))

Metrics:

Macro-F1 score (diagnosis)
Intersection over Union (IoU) (grounding/interpretability)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DRPO training significantly outperforms GRPO in diagnostic accuracy across diverse modalities.
Average across 8 clinical vision modalities	Macro-F1	Not reported in the paper	Not reported in the paper	+43% (relative improvement)
Salient Region Highlighting	IoU	Not reported in the paper	Not reported in the paper	10x higher

Main Takeaways

DRPO effectively mitigates the performance imbalance caused by skewed clinical data distributions, preventing the model from overfitting to easy/abundant domains.
QoQ-Med successfully integrates 1D ECG data with standard 2D/3D imaging, a capability missing in prior models like LLaVa-Med or Med-Flamingo.
The model achieves high interpretability by accurately bounding salient regions (high IoU), matching proprietary models like OpenAI o4-mini in this specific capability.
Hierarchical scaling in DRPO allows the model to prioritize learning from scarce and hard domains without the computational cost of a critic network.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Human Feedback (RLHF)
Multimodal Large Language Model architectures
Clinical data modalities (ECG, CT, MRI, X-ray)

Key Terms

GRPO: Group Relative Policy Optimization—a critic-free RL algorithm that normalizes rewards within a group of outputs for the same prompt to estimate advantages

DRPO: Domain-aware Relative Policy Optimization—the authors' proposed method that extends GRPO by scaling rewards based on domain rarity and question difficulty clusters

ECG-JEPA: A pre-trained joint embedding predictive architecture specifically designed for encoding electrocardiogram (ECG) time-series data

IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth region

Macro-F1: An F1 score (harmonic mean of precision and recall) averaged equally across classes, giving equal weight to rare and common classes

Chain of Thought: A reasoning process where the model generates intermediate steps or logic before producing the final answer

KL divergence: Kullback–Leibler divergence—a statistical distance measure used here to penalize the model from deviating too far from its initial reference policy during RL training