TrialMatchAI: An End-to-End AI-powered Clinical Trial Recommendation System to Streamline Patient-to-Trial Matching

📝 Paper Summary

Clinical Trial Matching Healthcare Information Retrieval Privacy-Preserving NLP

TrialMatchAI automates clinical trial recruitment by combining hybrid retrieval with locally deployed, fine-tuned open-source LLMs to match patients accurately while preserving data privacy.

Core Problem

Manual patient-to-trial matching is labor-intensive and error-prone, while existing AI solutions often rely on proprietary, cloud-based APIs that compromise patient privacy and lack transparency.

Why it matters:

Patient recruitment is a major bottleneck in drug development, often delaying life-saving treatments
Proprietary API-based models (like GPT-4) create barriers regarding cost, reproducibility, and regulatory compliance (GDPR/HIPAA)
Oncology matching requires complex reasoning over unstructured criteria (e.g., biomarkers, prior treatments) that keyword search misses

Concrete Example: A patient with 'metastatic non-small cell lung cancer' and an 'EGFR mutation' might be eligible for a trial recruiting 'advanced solid tumors' with specific genetic profiles. Standard keyword search misses the 'solid tumor' semantic connection, while proprietary LLMs raise privacy concerns. TrialMatchAI normalizes these entities and uses chain-of-thought reasoning to verify the genetic match locally.

Key Novelty

Privacy-First Modular RAG for Clinical Matching

Deconstructs the matching process into a local pipeline: Entity Normalization → Hybrid Retrieval → LLM Re-ranking → Chain-of-Thought Reasoning
Replaces massive proprietary models with smaller, fine-tuned open-source models (Gemma-2-2B, Phi-4) optimized for biomedical reasoning
Uses Phenopackets standardization to ingest heterogeneous patient data (structured records and unstructured notes) into a unified format

Architecture

The end-to-end workflow of TrialMatchAI, detailing the four processing levels from data ingestion to final ranking.

Evaluation Highlights

92.3% of real-world cancer patients (WIDE cohort) had at least one relevant trial retrieved within the top 20 recommendations
Achieved >90% recall on synthetic benchmarks (TREC 2021/2022) while retrieving only 3% of the total trial pool (approx. 500 documents)
Expert evaluation validated >90% accuracy in criterion-level eligibility classification using the fine-tuned reasoning model

Breakthrough Assessment

8/10

Strong contribution to privacy-preserving healthcare AI. It matches the performance of proprietary systems (like TrialGPT) using open-source models manageable in clinical settings, addressing a critical deployment barrier.

⚙️ Technical Details

Problem Definition

Setting: Rank a corpus of clinical trials T for a specific patient profile P based on eligibility criteria

Inputs: Patient data (unstructured notes + structured demographics) and Clinical Trial metadata (inclusion/exclusion criteria)

Outputs: Ranked list of eligible trials with criterion-level explanations (Met/Not Met)

Pipeline Flow

Group: Ingestion (NER + Normalization)
Group: Retrieval (Hybrid Search)
Group: Reasoning (Re-ranking + CoT Classification)

System Modules

Entity Extractor

Extract clinical entities (diseases, genes, treatments) from patient notes and trial criteria

Model or implementation: BioBERT, RoBERTa-large, GLiNER (fine-tuned)

Hybrid Retriever

Retrieve broad set of candidate trials using lexical and semantic matching

Model or implementation: Elasticsearch (BM25) + BGE-M3 (Vector Embeddings)

Re-Ranker (Reasoning)

Prioritize trials based on criterion-level relevance to patient

Model or implementation: Gemma-2-2B (fine-tuned)

Eligibility Classifier (Reasoning)

Perform final inclusion/exclusion decision with explanations

Model or implementation: Phi-4 (fine-tuned with Medical Chain-of-Thought)

Novel Architectural Elements

Hierarchical filtering pipeline: Entity Norm -> Hybrid Retrieval -> Lightweight Re-ranking -> Heavy CoT Reasoning
Integration of Phenopackets standard for interoperable patient data ingestion within a RAG workflow

Modeling

Base Model: Gemma-2-2B (Re-ranking) and Phi-4 (Eligibility Reasoning)

Training Method: Supervised Fine-Tuning (SFT) on biomedical datasets

Adaptation: Fine-tuning (implied full or LoRA, exact method not detailed in snippet)

Training Data:

Synthetic datasets (TREC 2021/2022 summaries)
Custom 'Ideal Candidates' dataset generated via GPT-4o-mini

Compute: Designed for 'lightweight deployment footprint' suitable for local clinical environments (exact GPU specs not in snippet)

Comparison to Prior Work

vs. TrialGPT: TrialMatchAI runs locally using open-source models (privacy/cost) vs. cloud API dependence
vs. TDMINER/h2oloo: Outperforms previous TREC winners in ranking metrics (nDCG@10) using RAG + CoT reasoning

Limitations

Susceptibility to LLM hallucinations (observed in <1% of expert-reviewed cases)
Inference speed of open-source models may lag behind optimized proprietary APIs
Performance depends on completeness of patient records (missing data handling)

Reproducibility

The system is described as fully open-source and modular. It uses standard datasets (clinicaltrials.gov, TREC) and open models (Gemma, Phi, BGE). The authors mention a code availability section, though the specific URL was not in the provided text snippet. Synthetic patient generation prompts are provided in Supplementary Materials.

📊 Experiments & Results

Evaluation Setup

Retrieval and Ranking of clinical trials for patient profiles

Benchmarks:

TREC 2021 Clinical Trials (Synthetic patient-trial matching)
TREC 2022 Clinical Trials (Synthetic patient-trial matching)
WIDE Cohort (NKI) (Real-world metastatic cancer patient matching) [New]

Metrics:

Recall (at varying retrieval sizes)
nDCG@k (Normalized Discounted Cumulative Gain)
Precision@k (P@5, P@10, P@20)
Accuracy (Expert review of criteria classification)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on TREC benchmarks demonstrates state-of-the-art ranking capabilities compared to previous challenge winners.
TREC 2021	nDCG@10	0.715	0.75	+0.035
TREC 2021	P@10	0.5760	0.77	+0.194
TREC 2022	nDCG@10	0.6125	0.75	+0.1375
Real-world evaluation on the WIDE cancer cohort confirms high recall for biomarker-driven trials.
WIDE Cohort (NKI)	Overall Recall (Top-20)	Not reported in the paper	0.9231	Not reported in the paper
WIDE Cohort (NKI)	Mean Reciprocal Rank (MRR) @ Top 20	Not reported in the paper	0.4904	Not reported in the paper

Experiment Figures

Performance metrics on synthetic TREC datasets.

Main Takeaways

Hybrid retrieval is highly effective, achieving >90% recall at just 3% of the search space (approx 500 trials), drastically reducing the load for the computationally expensive reasoning module.
The system excels in biomarker-driven matching, with expert review confirming 91% accuracy for molecular inclusion criteria in real cancer patients.
Performance is robust across datasets (TREC 2021 vs 2022), with consistent median nDCG scores (~0.75) indicating reliable ranking regardless of patient cohort variability.
Open-source models (Gemma/Phi) when fine-tuned can match or exceed the performance of proprietary predecessors on clinical retrieval tasks.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Natural Language Processing (NER, Embeddings)
Biomedical Ontologies
Clinical Trial Protocols

Key Terms

Phenopackets: A standardized open format for sharing disease and phenotype information, used here to unify diverse patient data sources

RAG: Retrieval-Augmented Generation—combining information retrieval with LLM generation to ground answers in specific data

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

NER: Named Entity Recognition—identifying specific terms like diseases, genes, or drugs within unstructured text

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query based on term frequency

BGE-M3: A dense embedding model used to convert text into vector representations for semantic search

nDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that prioritizes relevant items appearing earlier in the list

BioSyn: A method for biomedical entity normalization, mapping text mentions to standard ontology concepts