| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Hallucination Detection Task: FG-PRM compares favorably against prompting powerful LLMs (ChatGPT, Claude) for identifying specific error types. | ||||
| MATH (Synthetic Test Set) | F1 Score | Not reported as single aggregate number (See notes) | Not reported as single aggregate number (See notes) | Not reported as single aggregate number |
| Verification Task: Ranking 64 candidate solutions generated by Llama-3-70B and selecting the best one. | ||||
| GSM8K | Accuracy | See note | See note | See note |
| MATH | Accuracy | See note | See note | See note |