| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison against baselines on LongQA (Free form generation) shows consistent relevance improvements. | ||||
| LongQA | Answer Relevance | Not reported in the paper | Not reported in the paper | +2.69 |
| LongQA | Answer Relevance | Not reported in the paper | Not reported in the paper | +2.41 |
| Exact Match (EM) results on Multiple Choice tasks highlight the failure of simple dense ranking compared to tree-based ordering. | ||||
| LongQA-MC | Exact Match (EM) | Not reported in the paper | Not reported in the paper | +4.06 |
| LongQA-MC | Exact Match (EM) | Not reported in the paper | Not reported in the paper | +2.9 |
| Ablation on NarrativeQA confirms gains on extremely long contexts. | ||||
| NarrativeQA | Answer Relevance | Not reported in the paper | Not reported in the paper | +2.97 |