EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

📝 Paper Summary

Clinical Foundation Models Zero-Shot Prediction Electronic Health Records (EHR)

EveryQuery enables zero-shot clinical prediction by pretraining a transformer to answer structured queries about future events directly from patient history, bypassing expensive trajectory sampling.

Core Problem

Existing autoregressive EHR foundation models are computationally expensive, statistically noisy, and lack native promptability because they require sampling many future trajectories to estimate event probabilities.

Why it matters:

Inference is inefficient: long-horizon predictions require many generation steps per trajectory, compounding cost.
Estimates are noisy: rare event probabilities are often quantized to zero if the event doesn't appear in the limited sample of trajectories.
Limited interactivity: users cannot directly condition predictions on specific clinical questions without designing custom aggregation pipelines.

Concrete Example: To predict a rare event with prevalence p < 0.05, an autoregressive model sampling 20 trajectories often yields a probability of 0, missing the risk entirely. EveryQuery outputs a direct probability estimate in a single pass.

Key Novelty

Task-Conditioned Pretraining for EHR

Treats clinical prediction as a query-answering task where the input is (patient history, query code) and output is the probability of that code occurring.
Pretrains on randomly sampled queries (Does code 'c' occur in next 30 days?) across patient histories, learning to condition representations on the specific task requested.
Replaces generative next-token prediction with a discriminative objective that directly answers the user's structured query.

Evaluation Highlights

Outperforms an autoregressive baseline on 82% of 39 randomly sampled tasks, with a mean AUC improvement of +0.16.
Inference is ~3,000x faster than the autoregressive baseline (single forward pass vs. 20 trajectory rollouts).
Generalizes effectively to rare events where trajectory-based methods fail (Spearman correlation ρ = -0.32 between event rarity and performance gain).

Breakthrough Assessment

7/10

Significant efficiency and usability gains for clinical models. Solves the 'rare event' problem of autoregressive sampling. Currently limited by a simple query language (fixed 30-day window).

⚙️ Technical Details

Problem Definition

Setting: Zero-shot binary classification of future clinical events conditioned on a structured query

Inputs: Patient history sequence x and query q = (code c, time window Δt)

Outputs: Predicted probability of event occurrence P(occurs | x, q) and probability of censoring P(censored | x, q)

Pipeline Flow

Input Construction: [Query Token; Patient History Tokens]
Transformer Encoding: Bidirectional self-attention over combined sequence
Prediction Heads: Extract Query Token representation -> Classification MLPs

System Modules

Input Embedder

Maps patient events and the query code to dense vectors; prepends query token to patient sequence

Model or implementation: Shared code embedding layer

Backbone Encoder

Processes the sequence to capture dependencies between the query and patient history

Model or implementation: ModernBERT-base (22 layers, 149M params)

Occurrence Head (Prediction)

Estimates probability of the queried event occurring

Model or implementation: 2-layer MLP (hidden dim 128)

Censoring Head (Prediction)

Estimates probability that the data is censored (observation window incomplete)

Model or implementation: 2-layer MLP (hidden dim 128)

Novel Architectural Elements

Task-conditioned bottleneck: Using a prepended query token in a bidirectional encoder to aggregate task-specific information for direct classification, replacing the standard [CLS] token usage or decoder-only generation.

Modeling

Base Model: ModernBERT-base architecture (randomly initialized)

Training Method: Multi-task supervised pretraining over sampled queries

Objective Functions:

Purpose: Predict if the queried event happens (only calculated if not censored).

Formally: Masked Binary Cross Entropy on occurrence label.
Purpose: Predict if the window is censored.

Formally: Binary Cross Entropy on censoring label.

Training Data:

MIMIC-IV (v2.2) converted to MEDS format
200,773 training patients
Queries sampled from 10,000 codes; remaining ~1,500 codes held out for OOD evaluation

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 160
max_steps: 40,000
+ 3 more
warmup_steps: 2,000
weight_decay: 0.05
context_length: 256 tokens

Compute: Single NVIDIA L40S GPU, ~6.5 hours (385 minutes)

Comparison to Prior Work

vs. MEDS-EIC-AR: Deterministic single-pass inference vs. stochastic multi-trajectory sampling
vs. MOTOR/CLMBR: Native zero-shot capability via prompt tokens vs. requiring finetuning or heads
vs. Med-BERT [not cited in paper]: Task-conditioned inputs vs. generic masked language modeling pretraining

Limitations

Limited query language: Fixed 30-day prediction window used in all experiments; temporal generalization untested.
Expressiveness: Underperforms on complex disjunctive tasks (e.g., readmission) requiring reasoning over multiple codes.
Context length: Limited to 256 tokens, potentially discarding older but relevant medical history.
Single dataset: Evaluated only on MIMIC-IV.

Reproducibility

Code availability is not provided in the text. MIMIC-IV data is access-controlled but publicly applicable. Preprocessing pipeline (MIMIC_IV_MEDS) is open source.

📊 Experiments & Results

Evaluation Setup

Zero-shot binary classification on MIMIC-IV

Benchmarks:

Randomly Sampled Tasks (Binary prediction of specific medical codes (30-day window)) [New]
30-Day Readmission (Complex disjunctive reasoning (any readmission event))

Metrics:

AUC (Area Under the Receiver Operating Characteristic Curve)
AUPRC (Area Under the Precision-Recall Curve)
Statistical methodology: Wilcoxon signed-rank test for win-rate significance; 95% Confidence Intervals reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EveryQuery demonstrates superior performance on randomly sampled specific prediction tasks compared to the autoregressive baseline.
39 Random Tasks (MIMIC-IV)	Win Rate	18%	82%	+64%
39 Random Tasks (MIMIC-IV)	Mean AUC Improvement	Not reported in the paper	Not reported in the paper	+0.16
The model performs worse on complex tasks requiring logical disjunction (ANY readmission) vs specific code prediction.
30-Day Readmission	AUC	0.748	0.686	-0.062

Main Takeaways

Task-conditioned pretraining works: Can predict outcomes for unseen codes (OOD) as well as seen ones, proving the model learns to use the query embedding.
Efficiency is transformative: 3000x speedup enables real-time interaction compared to waiting for 20 trajectory generations.
Rare events are handled much better: The discriminative approach avoids the 'zero probability' issue inherent in sampling-based estimation for low-prevalence outcomes.
Embedding analysis confirms prompt specificity: Representations cluster by query type (task) rather than by patient, indicating the model successfully reconfigures its attention based on the prompt.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically BERT-style encoders)
Electronic Health Records (EHR) data structure
Autoregressive vs. Discriminative modeling
Survival analysis concepts (censoring)

Key Terms

autoregressive inference: Generating future data one step at a time, using previous outputs as inputs for the next step; slow and stochastic

zero-shot prediction: Making predictions for tasks the model wasn't explicitly finetuned on

censoring: When a patient's data ends before the observation window completes, making it impossible to know if an event occurred

MEDS: Medical Event Data Standard—a standardized format for EHR data

trajectory rollouts: Simulating multiple possible future timelines for a patient to estimate the likelihood of an outcome

MIMIC-IV: A large, publicly available database of critical care patients used for benchmarking clinical models