Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

📝 Paper Summary

Medical Question Answering Chain-of-Thought Reasoning Clinical Decision Support

The paper introduces a prompt-driven framework for open-ended medical QA that mimics clinical incremental reasoning and uses a verifier model to select the best diagnosis from generated candidates.

Core Problem

Existing medical QA datasets and prompts rely on multiple-choice formats (MCQ) and eliminative reasoning, which do not reflect real-world clinical scenarios where doctors must generate diagnoses from open-ended patient histories without pre-set options.

Why it matters:

Real-life clinical practice involves prospective, incremental reasoning rather than retrospective elimination of options
Current LLM benchmarks (MedQA-USMLE) rely on 4-option MCQs, limiting their utility for open-ended patient queries in deployed healthcare systems
Standard Chain-of-Thought prompts often bias models towards elimination strategies rather than constructive clinical diagnosis

Concrete Example: In a standard setup, a model is given a patient history and options (A, B, C, D) and asked to pick one. In a real clinic, a doctor sees only the history and must derive 'Eplerenone' from scratch. Standard prompts struggle to generate this without options or use artificial elimination logic even when options are hidden.

Key Novelty

ClinicR (Clinical Reasoning Prompt) + Forward-Backward Verification

Proposes ClinicR, a 5-shot prompt that mimics a doctor's incremental reasoning: reading history, forming initial hypotheses, updating them with new findings, and concluding a final diagnosis
Introduces a Forward-Backward approach: first generating multiple potential diagnoses (Forward) using ClinicR, then selecting the best one (Backward) using a trained Verifier (Reward Model)

Architecture

Comparison of Eliminative vs. ClinicR prompting strategies in both Open and MCQ settings

Evaluation Highlights

Medical experts agreed with 87% of answers generated by Llama-2-70B-chat using the Forward-Backward ClinicR approach on the new MedQA-Open dataset
ClinicR outperforms the state-of-the-art eliminative CoT prompt by substantial margins on open-ended tasks (83% vs 56% expert agreement on Llama-2-7B-chat)
The verifier-based selection achieves 90% expert agreement on real-world ClinicianCases with Llama-2-7B-chat

Breakthrough Assessment

7/10

Strong practical contribution by shifting from MCQ to open-ended medical QA with expert verification. The method is intuitive and effective, though primarily a prompt engineering and reranking innovation rather than a fundamental architectural shift.

⚙️ Technical Details

Problem Definition

Setting: Open-ended Medical Question Answering (Generative QA)

Inputs: Natural language medical question/patient history q

Outputs: Free-text diagnosis/answer a (and reasoning r)

Pipeline Flow

Forward Step: Generate Candidates (ClinicR Prompt)
Backward Step: Verify & Select (Reward Model)

System Modules

Candidate Generator

Generate multiple potential diagnoses/answers using incremental reasoning

Model or implementation: Llama-2-7B-chat or Llama-2-70B-chat

Verifier (Reward Model)

Score each candidate answer to select the best diagnosis

Model or implementation: Llama-2-7B-chat or Llama-2-70B-chat with linear head (LoRA fine-tuned)

Novel Architectural Elements

Integration of incremental clinical reasoning (ClinicR) into the generation phase, explicitly structured to update differential diagnoses as new information is processed

Modeling

Base Model: Llama-2-7B-chat and Llama-2-70B-chat

Training Method: Reward Modeling (Verification)

Objective Functions:

Purpose: Train verifier to score correct answers higher than incorrect ones.

Formally: Reward modeling loss (ranking loss)

Adaptation: LoRA (rank=16, alpha=16)

Training Data:

Derived from MedQA-MCQ: For each question, 4 examples created (1 correct, 3 incorrect)
Augmented with CoT reasoning generated by Eliminative prompt (verified ~97% accurate on sample)

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 2
gradient_accumulation_steps: 16
+ 1 more
optimizer: AdamW

Compute: Llama-2-7B-chat: 1 Nvidia A100 80GB (72 hrs); Llama-2-70B-chat: 4 Nvidia A100 80GB (72 hrs)

Comparison to Prior Work

vs. Liévin et al. (2022): ClinicR uses constructive/incremental reasoning instead of elimination; operates in open-ended setting rather than strictly MCQ
vs. Standard CoT: Specifically structured to mirror clinical differential diagnosis steps (add/delete hypothesis)

Limitations

Evaluation relies heavily on human expert agreement (Likert scale), which can be subjective
Verifier training still relies on MCQ dataset (MedQA-MCQ) as proxy for correctness
Computational cost of sampling multiple candidates and running verification is higher than single-pass prompting

Reproducibility

Code: https://github.com/ColdSeal/ClinicR

Code and prompts available at https://github.com/ColdSeal/ClinicR. Reward model verifier released. MedQA-Open dataset constructed from MedQA-USMLE. Human evaluation details (8 medical students) provided.

📊 Experiments & Results

Evaluation Setup

Open-ended medical QA on modified USMLE questions and real clinician cases

Benchmarks:

MedQA-Open (Open-ended Medical QA) [New]
ClinicianCases (Real-world Medical QA) [New]

Metrics:

Expert Agreement (Likert Scale: Agree, Neutral, Disagree)
Accuracy (for MCQ comparisons only)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of prompting strategies on the open-ended MedQA-Open dataset using human expert evaluation.
MedQA-Open (500 samples)	Expert Agreement % (Llama-2-7B-chat)	56	83	+27
MedQA-Open (500 samples)	Expert Agreement % (Llama-2-70B-chat)	84	87	+3
Results using the Forward-Backward approach with the Verifier.
MedQA-Open (500 samples)	Expert Agreement % (Llama-2-7B-chat)	56	87	+31
ClinicianCases (25 samples)	Expert Agreement % (Llama-2-7B-chat)	90	90	0

Experiment Figures

Bar charts of Expert Evaluation (Agree/Neutral/Disagree) on MedQA-Open for Llama-2-7B and 70B

Main Takeaways

ClinicR prompting strategy is far superior to Eliminative strategies for open-ended generation, especially on smaller models (7B), where it mimics clinical reasoning effectively
The Forward-Backward approach (sampling multiple ClinicR outputs + Verifier selection) achieves the highest consistency, bridging the gap between open-ended generation and verified accuracy
In strict MCQ settings (where options are provided), the advantage of ClinicR diminishes, suggesting its primary value is in realistic open-ended scenarios

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) Prompting
Large Language Models (LLMs)
Reward Modeling / Verification

Key Terms

MedQA-Open: A modified version of the MedQA-USMLE dataset where multiple-choice options are removed to create open-ended questions

ClinicR: A 5-shot CoT prompt designed to mimic incremental clinical reasoning (hypothesis generation -> update -> conclusion)

Eliminative Prompt: The baseline CoT prompt that reasons by eliminating multiple-choice options (adapted from Liévin et al., 2022)

Forward-Backward Approach: A two-step inference method: generating multiple candidate answers (Forward) and selecting the best one using a verifier (Backward)

Verifier: A reward model trained to predict the likelihood of a candidate answer being correct given the question

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices