← Back to Paper List

Chatqa 2: Bridging the gap to proprietary llms in long context andragcapabilities

P Xu, W Ping, X Wu, C Xu, Z Liu, M Shoeybi…
NVIDIA
arXiv, 7/2024 (2024)
RAG Pretraining QA Benchmark

📝 Paper Summary

Long Context LLM Modularized RAG pipeline
ChatQA 2 extends Llama3 to a 128K context window via continued pretraining and stage-wise instruction tuning, demonstrating that retrieval-augmented generation outperforms pure long-context processing when using sufficient retrieved chunks.
Core Problem
Open-access LLMs lag behind proprietary models (like GPT-4) in handling ultra-long contexts, and there is a lack of open recipes for effectively combining long-context capabilities with retrieval-augmented generation.
Why it matters:
  • Processing large volumes of information (e.g., hundreds of pages) is essential for real-world enterprise applications but current open models often fail at ultra-long tasks.
  • The trade-off between feeding an entire document into a long context window versus using retrieval (RAG) is poorly understood for open models.
  • Existing open long-context models are often evaluated on synthetic tasks (like Needle-in-a-Haystack) rather than realistic downstream tasks.
Concrete Example: When answering a question based on a 100K+ token document, a standard Llama-3-70B (8K context) physically cannot process the text. Meanwhile, standard RAG with small top-k (e.g., k=5) might miss the answer. ChatQA 2 addresses this by enabling 128K context processing and showing that RAG with large top-k (e.g., top-20 chunks) outperforms processing the full text directly.
Key Novelty
Llama3-ChatQA-2-70B (128K Context)
  • Extends Llama-3's context from 8K to 128K by increasing RoPE base frequency and continuing pretraining on upsampled long documents.
  • Uses a three-stage instruction tuning recipe: (1) short instruction following, (2) short RAG/contextual QA, and (3) long-context instruction tuning using synthetic and aggregated long datasets.
  • Integrates a long-context retriever (E5-Mistral) to demonstrate that retrieving many chunks (RAG) is often superior to feeding the full long document directly.
Evaluation Highlights
  • ChatQA-2-70B achieves 56.6 F1 on the InfiniteBench En.QA task (128K context), outperforming GPT-4-Turbo-2024-04-09 (48.8 F1) and Qwen2-72B-Instruct (43.4 F1).
  • On the RAG benchmark (ChatRAG Bench) using a 4K context window, ChatQA-2-70B scores 52.9 average F1, surpassing GPT-4-Turbo (51.3 F1) and Llama-3-70B-Instruct (49.3 F1).
  • Using RAG with top-20 retrieved chunks yields better performance (49.8 F1 on average) than using the full long context (44.9 F1) across 32K context benchmarks.
Breakthrough Assessment
8/10
Provides a reproducible recipe for bringing open models to GPT-4 level on long-context tasks. The finding that RAG outperforms direct long-context processing (even with 128K windows) is practically significant.
×