Evaluation Setup
Evaluated on Ultra-long context (>100K), Long context (<32K), and Short context RAG (<4K) benchmarks.
Benchmarks:
- InfiniteBench (Ultra-long context (En.Sum, En.QA, En.MC, En.Dia))
- ChatRAG Bench (Conversational QA and RAG (10 datasets))
- LongBench / SCROLLS subset (Long context QA and Summarization (QMSum, Qasper, QuALITY, HotpotQA, etc.))
Metrics:
- F1 score
- ROUGE-L-Sum
- Exact Match (EM)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Ultra-long context performance (InfiniteBench) shows ChatQA-2 competitive with or superior to proprietary models. |
| InfiniteBench En.QA |
F1 |
48.8 |
56.6 |
+7.8
|
| InfiniteBench En.Sum |
ROUGE-L |
14.9 |
19.3 |
+4.4
|
| InfiniteBench En.MC |
Accuracy |
67.4 |
72.4 |
+5.0
|
| RAG capability on short context benchmarks (ChatRAG Bench). |
| ChatRAG Bench (Average) |
F1 |
51.3 |
52.9 |
+1.6
|
| Comparison of Long Context vs. RAG approaches on the same models (32K benchmarks). |
| 32K Benchmarks (Average) |
Average Score |
44.9 |
49.8 |
+4.9
|
Main Takeaways
- ChatQA-2-70B outperforms GPT-4-Turbo and Qwen2-72B on ultra-long context tasks (InfiniteBench) and standard RAG benchmarks.
- RAG (Retrieval-Augmented Generation) consistently outperforms direct processing of the full long context when a sufficient number of chunks (e.g., top-20) are retrieved, even for models with 128K context windows.
- The 'Needle in a Haystack' test is insufficient for evaluating real-world long-context performance; models that pass it may still fail on tasks like InfiniteBench.
- Separating documents with special tokens (<s>) during pretraining is more effective than using <BOS>/<EOS> tokens for context extension in Llama-3.