Evaluation Setup
Zero-shot binary classification on MIMIC-IV
Benchmarks:
- Randomly Sampled Tasks (Binary prediction of specific medical codes (30-day window)) [New]
- 30-Day Readmission (Complex disjunctive reasoning (any readmission event))
Metrics:
- AUC (Area Under the Receiver Operating Characteristic Curve)
- AUPRC (Area Under the Precision-Recall Curve)
- Statistical methodology: Wilcoxon signed-rank test for win-rate significance; 95% Confidence Intervals reported
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| EveryQuery demonstrates superior performance on randomly sampled specific prediction tasks compared to the autoregressive baseline. |
| 39 Random Tasks (MIMIC-IV) |
Win Rate |
18% |
82% |
+64%
|
| 39 Random Tasks (MIMIC-IV) |
Mean AUC Improvement |
Not reported in the paper |
Not reported in the paper |
+0.16
|
| The model performs worse on complex tasks requiring logical disjunction (ANY readmission) vs specific code prediction. |
| 30-Day Readmission |
AUC |
0.748 |
0.686 |
-0.062
|
Main Takeaways
- Task-conditioned pretraining works: Can predict outcomes for unseen codes (OOD) as well as seen ones, proving the model learns to use the query embedding.
- Efficiency is transformative: 3000x speedup enables real-time interaction compared to waiting for 20 trajectory generations.
- Rare events are handled much better: The discriminative approach avoids the 'zero probability' issue inherent in sampling-based estimation for low-prevalence outcomes.
- Embedding analysis confirms prompt specificity: Representations cluster by query type (task) rather than by patient, indicating the model successfully reconfigures its attention based on the prompt.