| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance on the 'Wisdom of Crowds' setting, comparing Verdict Accuracy vs. Justification Quality. | ||||
| Fact-Audit (Wisdom of Crowds) | Verdict Accuracy | 0.7636 | 0.7727 | +0.0091 |
| Fact-Audit (Wisdom of Crowds) | Justification Score | 7.73 | 8.55 | +0.82 |
| Impact of Adaptive Auditing: Comparing performance on initial 'Prototype' data vs. harder 'Probed' data. | ||||
| Fact-Audit (Probing Effect) | Verdict Accuracy (GPT-4o) | 0.7600 | 0.6667 | -0.0933 |
| Fact-Audit (Probing Effect) | Verdict Accuracy (GPT-4o) | 0.9385 | 0.7727 | -0.1658 |