Evaluation Setup
Long-context Question Answering on books and narratives
Benchmarks:
- HELMET (LongQA) (Long-document QA)
- HELMET (LongQA-MC) (Multiple Choice QA)
- NarrativeQA (QA over extremely long narratives (>256K tokens))
Metrics:
- Ragas Answer Relevance
- Exact Match (EM) Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison against baselines on LongQA (Free form generation) shows consistent relevance improvements. |
| LongQA |
Answer Relevance |
Not reported in the paper |
Not reported in the paper |
+2.69
|
| LongQA |
Answer Relevance |
Not reported in the paper |
Not reported in the paper |
+2.41
|
| Exact Match (EM) results on Multiple Choice tasks highlight the failure of simple dense ranking compared to tree-based ordering. |
| LongQA-MC |
Exact Match (EM) |
Not reported in the paper |
Not reported in the paper |
+4.06
|
| LongQA-MC |
Exact Match (EM) |
Not reported in the paper |
Not reported in the paper |
+2.9
|
| Ablation on NarrativeQA confirms gains on extremely long contexts. |
| NarrativeQA |
Answer Relevance |
Not reported in the paper |
Not reported in the paper |
+2.97
|
Main Takeaways
- Dependency-aware ordering (Chow-Liu) consistently outperforms both default document order and simple semantic ranking across all tested models.
- Greedy approaches (like localized DFS or simple dense ranking) are suboptimal because they may separate globally dependent chunks.
- The method is robust across model sizes (from GPT-4.1-mini to GPT-4.1) but sensitive to the quality of the embedding function (BM25 underperforms dense embeddings).
- Structuring input based on global dependencies mitigates the 'lossy compression' effect inherent in sequential memory updates.