| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Calibration performance (lower ECE is better) on Llama-3-8B-Instruct shows LoVeC (via GRPO/DPO) outperforming prompt-based methods. | ||||
| ASQA (Iterative Tagging) | ECE | 0.45 | 0.12 | -0.33 |
| Efficiency comparison shows dramatic speedups over sampling-based methods. | ||||
| Inference Latency | Speedup Factor | 1.0 | 20.0 | +19.0 |