| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Correlation analysis demonstrating the effectiveness of the proposed weighted metric compared to human judgment. | ||||
| Human Judgment Correlation | Pearson r | Not reported in the paper | 0.701 | Not reported in the paper |
| Dataset statistics highlighting the scale and complexity of the new benchmark. | ||||
| LongHalluQA vs Original Datasets | Average Response Length Increase | 1.0 | 9.4 | 8.4 |