Evaluation Setup
Medical Question Answering on 5 standard benchmarks.
Benchmarks:
- USMLE (Medical Licensing Exam Questions)
- MMLU-Medical (Medical knowledge multiple choice)
- PubMedQA (Biomedical QA)
- BioASQ (Biomedical QA)
- MedMCQA (Medical entrance exam questions)
Metrics:
- Accuracy (Standard)
- Filtered Accuracy (Accuracy on responses that pass fact-check)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Evaluation of Mechanism I (Fact-Check-Then-RAG) on Llama-3-70B-Instruct shows improvements over both the base model and standard MedRAG. |
| USMLE |
Accuracy |
62.58 |
67.57 |
+4.99
|
| PubMedQA |
Accuracy |
73.20 |
86.20 |
+13.00
|
| Evaluation of Mechanism II (Self-Training with SimPO) on Llama-3-8B-Instruct shows that using fact-checks as preference signals improves base model performance. |
| USMLE |
Accuracy |
45.15 |
49.23 |
+4.08
|
| PubMedQA |
Accuracy |
71.00 |
77.80 |
+6.80
|
| BioASQ |
Accuracy |
74.04 |
81.49 |
+7.45
|
Main Takeaways
- Fact-Check-Then-RAG avoids the performance degradation seen in standard MedRAG on some datasets (like USMLE), likely by only retrieving when necessary.
- Self-training works effectively with fact-checking signals: both SFT (on verified responses) and SimPO (ranking by factuality) significantly improve the 8B model.
- SimPO with LEAF ranking generally outperforms SimPO with ArmoRM (a general reward model), suggesting domain-specific factuality is a better signal for medical QA than general preference.
- The gap between the best and worst responses ranked by LEAF is larger than that of ArmoRM, indicating LEAF is more discriminative for correctness.