FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

📝 Paper Summary

Medical Text Simplification Factuality Evaluation

FACTPICO is an expert-annotated benchmark for evaluating the factuality of LLM-generated plain language summaries of medical evidence, specifically focusing on critical trial elements (Populations, Interventions, Comparators, Outcomes) and analyzing automated metric performance.

Core Problem

Plain language summarization of medical evidence by LLMs lacks a standard evaluation benchmark, making it unknown whether current models introduce critical factual errors in high-stakes domains.

Why it matters:

RCT abstracts are inaccessible to laypeople, preventing patients from accessing up-to-date healthcare evidence
LLM-generated simplifications risk introducing health-critical errors if not rigorously fact-checked
Existing factuality metrics correlate poorly with expert judgments on granular medical details, rendering automated evaluation unreliable

Concrete Example: In a study about genetic risk and folate intake, GPT-4 correctly identified the intervention (personalized advice) but completely omitted the comparator (general healthy eating advice), making it impossible for a reader to understand what the intervention was tested against. This omission distorts the trial's actual findings.

Key Novelty

Fine-Grained Expert Evaluation of RCT Elements (FACTPICO)

Decomposes evaluation into specific PICO elements (Population, Intervention, Comparator, Outcome) and Evidence Inference rather than a single generic quality score
Includes expert-written natural language rationales for every rating, documenting the reasoning behind factuality judgments and addressing the correctness of added explanations
Proposes a PICO-R extraction pipeline that first isolates key medical elements before evaluating them, significantly improving LLM-based evaluator performance

Architecture

An example of the FACTPICO annotation interface and process, showing a GPT-4 generated summary and the corresponding expert evaluation of PICO elements

Evaluation Highlights

PICO-R extraction pipeline (GPT-4) achieves highest correlation with experts (Kendall's τb = 0.475), outperforming standard GPT-4 evaluation (0.055)
LLM simplifications frequently contain non-factual additions (hallucinations): 31.3% of GPT-4 summaries and 38.3% of Llama-2 summaries contain at least one non-factual added span
Existing automatic metrics (e.g., QuestEval, DAE) show weak to moderate correlation with human judgment (Spearman's ρ ranging from 0.219 to 0.412), failing to capture instance-level nuance

Breakthrough Assessment

8/10

Creates a critical, high-quality expert benchmark for a high-stakes domain. Reveals significant flaws in both LLM summarizers and current evaluation metrics, establishing a new standard for medical text simplification evaluation.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot plain language summarization of randomized controlled trial (RCT) abstracts

Inputs: Technical medical abstract describing an RCT

Outputs: Plain language summary accessible to lay readers

Pipeline Flow

Input Abstract
PICO-R Extraction (via GPT-4)
Element-wise Verification (via GPT-4)
Factuality Score Output

System Modules

PICO-R Extractor (Evaluation Pipeline)

Identify and extract text spans corresponding to Population, Intervention, Comparator, Outcome, and Results from both the source abstract and generated summary

Model or implementation: GPT-4

Factuality Scorer (Evaluation Pipeline)

Compare extracted elements to determine if the summary accurately reflects the source evidence

Model or implementation: GPT-4

Novel Architectural Elements

Two-stage evaluation architecture: Explicit extraction of clinical evidence elements (PICO-R) prior to verification, rather than end-to-end scoring
Integration of granular evidence inference assessment (results directionality) alongside entity-level PICO verification

Modeling

Base Model: GPT-4, Llama-2-Chat (7B), Alpaca (7B), Mistral (7B-Instruct-v0.1) used for meta-evaluation

Reproducibility

Code: https://github.com/lilywchen/FactPICO

📊 Experiments & Results

Evaluation Setup

Meta-evaluation of automatic metrics against expert human judgments on plain language summaries

Benchmarks:

FACTPICO (Factuality evaluation of medical summarization) [New]

Metrics:

Kendall's τb (correlation)
Spearman's ρ (correlation)
Pairwise Accuracy
ROUGE-L
Flesch-Kincaid Grade Level
Statistical methodology: Randolph's kappa for inter-annotator agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation analysis shows the PICO-R extraction pipeline outperforms standard metrics and direct LLM prompting in aligning with human judgments.
FACTPICO	Kendall's τb	0.290	0.474	+0.184
FACTPICO	Spearman's ρ	0.080	0.633	+0.553
FACTPICO	Kendall's τb	0.075	0.372	+0.297
Analysis of generated summaries reveals a trade-off between simplicity and factuality, with significant hallucination rates.
FACTPICO	% summaries with non-factual additions	7.0	31.3	+24.3
FACTPICO	ROUGE-L	0.479	0.146	-0.333

Experiment Figures

Scatter plots correlating automated metric scores (QuestEval, GPT-4 direct, Extraction Pipeline) against human ratings

Main Takeaways

Plain language summarization involves a trade-off: models that simplify more (GPT-4, Llama-2) introduce significantly more factual errors than conservative models (Alpaca)
Standard factuality metrics (QuestEval, AlignScore) correlate poorly with expert judgments on fine-grained medical elements
Decomposing evaluation into PICO extraction followed by verification (PICO-R pipeline) dramatically improves automated metric performance compared to direct LLM scoring
LLMs frequently overgeneralize or omit critical details (e.g., specific comparators) which can fundamentally alter the interpretation of medical evidence

📚 Prerequisite Knowledge

Prerequisites

Understanding of Randomized Controlled Trials (RCTs)
Familiarity with text summarization and simplification tasks
Basic knowledge of evaluation metrics (correlation coefficients, ROUGE)

Key Terms

RCT: Randomized Controlled Trial—a scientific study design that randomly assigns participants to an experimental group or a control group to measure the effectiveness of an intervention

PICO: Population, Intervention, Comparator, Outcome—the four key components used to structure clinical evidence questions and define trial characteristics

PICO-R: PICO elements plus Results/Evidence Inference—the combination of trial characteristics and the findings concerning them

plain language summarization: The task of generating summaries that simplify technical content for lay readers, often involving elaboration and explanation of difficult concepts

Evidence Inference: The task of determining whether an intervention yielded a significant difference compared to a control group with respect to a specific outcome

hallucination: In this context, non-factual information added by the model that is not supported by the source text or general medical knowledge

Kendall's τb: A rank correlation coefficient used to measure the ordinal association between two measured quantities (e.g., human ratings vs. metric scores)

Spearman's ρ: A nonparametric measure of rank correlation assessing how well the relationship between two variables can be described using a monotonic function

ROUGE-L: Recall-Oriented Understudy for Gisting Evaluation (Longest Common Subsequence)—a metric measuring text overlap based on the longest matching sequence of words

Flesch-Kincaid Grade Level: A readability metric that indicates the US school grade level required to understand a text

BERTScore: An automatic evaluation metric that computes similarity between candidate and reference text using contextual embeddings

zero-shot prompting: Asking a model to perform a task without providing any examples of that task in the prompt