← Back to Paper List

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Che Liu, Haozhe Wang, Jiazhen Pan, Zhongwei Wan, Yong Dai, Fangzhen Lin, Wenjia Bai, D. Rueckert, Rossella Arcucci
Imperial College London, Technical University of Munich, Ohio State University, Fudan University
arXiv.org (2025)
RL Reasoning QA Benchmark

📝 Paper Summary

Medical Reasoning Reinforcement Learning (RL)
AlphaMed demonstrates that medical LLMs can develop advanced reasoning capabilities purely through rule-based reinforcement learning on public multiple-choice datasets, without needing expensive supervised fine-tuning on distilled chain-of-thought data.
Core Problem
Current medical LLMs rely heavily on Supervised Fine-Tuning (SFT) using costly Chain-of-Thought (CoT) data distilled from proprietary models like GPT-4, which limits scalability and introduces dependency on closed-source teachers.
Why it matters:
  • Distilling CoT data from commercial models is expensive and legally/ethically complex due to licensing restrictions
  • SFT on distilled data often leads to memorization of rationales rather than genuine reasoning generalization
  • Existing RL methods (PPO/DPO) require either complex learned reward models or ambiguous preference pairs that are hard to define in medical contexts
Concrete Example: A standard medical model might answer a complex clinical case correctly but fail to explain why, or hallucinate a rationale it memorized during SFT. AlphaMed, trained only on final answers (A/B/C/D), spontaneously generates step-by-step reasoning (e.g., 'Step 1... Step 2...') to derive the correct conclusion, despite never seeing such traces during training.
Key Novelty
AlphaMed (Minimalist Rule-Based RL for Medical Reasoning)
  • Train medical LLMs using Group Relative Policy Optimization (GRPO) with a simple binary reward based on final answer correctness, bypassing the need for a separate reward model
  • Replace the standard 'SFT on CoT' pipeline with direct RL on informative multiple-choice questions, showing that reasoning emerges as a byproduct of optimizing for accuracy
  • Use a data-centric selection strategy that prioritizes high-informativeness data (e.g., USMLE questions) over scale, discarding noisy datasets that hinder reasoning
Evaluation Highlights
  • AlphaMed-70B outperforms GPT-4o and Claude-3.5-Sonnet on the challenging MedXpert benchmark (84.2% vs ~82-83%)
  • AlphaMed-8B surpasses HuatuoGPT-o1-8B (trained with distilled CoT) across all six benchmarks, including a +3.2% gain on MedXpert
  • Scaling training data informativeness (using MedQA vs PubMedQA) is critical: adding noisy PubMedQA data actually degraded performance
Breakthrough Assessment
9/10
Strongly challenges the prevailing paradigm that CoT SFT is necessary for reasoning. Achieving SOTA results (beating GPT-4o) using only public multiple-choice data and simple RL is a significant efficiency and methodology breakthrough.
×