| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Evaluation of Mechanism I (Fact-Check-Then-RAG) on Llama-3-70B-Instruct shows improvements over both the base model and standard MedRAG. | ||||
| USMLE | Accuracy | 62.58 | 67.57 | +4.99 |
| PubMedQA | Accuracy | 73.20 | 86.20 | +13.00 |
| Evaluation of Mechanism II (Self-Training with SimPO) on Llama-3-8B-Instruct shows that using fact-checks as preference signals improves base model performance. | ||||
| USMLE | Accuracy | 45.15 | 49.23 | +4.08 |
| PubMedQA | Accuracy | 71.00 | 77.80 | +6.80 |
| BioASQ | Accuracy | 74.04 | 81.49 | +7.45 |