| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison of prompting strategies on the open-ended MedQA-Open dataset using human expert evaluation. | ||||
| MedQA-Open (500 samples) | Expert Agreement % (Llama-2-7B-chat) | 56 | 83 | +27 |
| MedQA-Open (500 samples) | Expert Agreement % (Llama-2-70B-chat) | 84 | 87 | +3 |
| Results using the Forward-Backward approach with the Verifier. | ||||
| MedQA-Open (500 samples) | Expert Agreement % (Llama-2-7B-chat) | 56 | 87 | +31 |
| ClinicianCases (25 samples) | Expert Agreement % (Llama-2-7B-chat) | 90 | 90 | 0 |