Chatqa 2: Bridging the gap to proprietary llms in long context andragcapabilities

📝 Paper Summary

Long Context LLM Modularized RAG pipeline

ChatQA 2 extends Llama3 to a 128K context window via continued pretraining and stage-wise instruction tuning, demonstrating that retrieval-augmented generation outperforms pure long-context processing when using sufficient retrieved chunks.

Core Problem

Open-access LLMs lag behind proprietary models (like GPT-4) in handling ultra-long contexts, and there is a lack of open recipes for effectively combining long-context capabilities with retrieval-augmented generation.

Why it matters:

Processing large volumes of information (e.g., hundreds of pages) is essential for real-world enterprise applications but current open models often fail at ultra-long tasks.
The trade-off between feeding an entire document into a long context window versus using retrieval (RAG) is poorly understood for open models.
Existing open long-context models are often evaluated on synthetic tasks (like Needle-in-a-Haystack) rather than realistic downstream tasks.

Concrete Example: When answering a question based on a 100K+ token document, a standard Llama-3-70B (8K context) physically cannot process the text. Meanwhile, standard RAG with small top-k (e.g., k=5) might miss the answer. ChatQA 2 addresses this by enabling 128K context processing and showing that RAG with large top-k (e.g., top-20 chunks) outperforms processing the full text directly.

Key Novelty

Llama3-ChatQA-2-70B (128K Context)

Extends Llama-3's context from 8K to 128K by increasing RoPE base frequency and continuing pretraining on upsampled long documents.
Uses a three-stage instruction tuning recipe: (1) short instruction following, (2) short RAG/contextual QA, and (3) long-context instruction tuning using synthetic and aggregated long datasets.
Integrates a long-context retriever (E5-Mistral) to demonstrate that retrieving many chunks (RAG) is often superior to feeding the full long document directly.

Evaluation Highlights

ChatQA-2-70B achieves 56.6 F1 on the InfiniteBench En.QA task (128K context), outperforming GPT-4-Turbo-2024-04-09 (48.8 F1) and Qwen2-72B-Instruct (43.4 F1).
On the RAG benchmark (ChatRAG Bench) using a 4K context window, ChatQA-2-70B scores 52.9 average F1, surpassing GPT-4-Turbo (51.3 F1) and Llama-3-70B-Instruct (49.3 F1).
Using RAG with top-20 retrieved chunks yields better performance (49.8 F1 on average) than using the full long context (44.9 F1) across 32K context benchmarks.

Breakthrough Assessment

8/10

Provides a reproducible recipe for bringing open models to GPT-4 level on long-context tasks. The finding that RAG outperforms direct long-context processing (even with 128K windows) is practically significant.

⚙️ Technical Details

Problem Definition

Setting: Long-context language modeling and retrieval-augmented question answering

Inputs: Long text documents (up to 128K tokens) or a user query q requiring external knowledge

Outputs: Answer generated based on the long context or retrieved chunks

Pipeline Flow

Retriever (E5-Mistral) retrieves top-k chunks from corpus
Input Formatter concatenates query + retrieved chunks (or full long document)
Llama3-ChatQA-2 (128K) generates response

System Modules

Retriever

Retrieve relevant text chunks given a query

Model or implementation: E5-Mistral-7B-Instruct

ChatQA-2 Model

Process long context or retrieved chunks to answer user query

Model or implementation: Llama3-ChatQA-2-70B (128K context)

Novel Architectural Elements

Three-stage instruction tuning pipeline specifically designed to blend short-context SFT, RAG capability, and synthetic long-context data

Modeling

Base Model: Llama-3-70B-Base

Training Method: Continued Pretraining followed by Supervised Fine-Tuning (SFT)

Training Data:

Continued Pretraining: SlimPajama with upsampled long sequences (10B tokens total)
Stage 1 SFT: High-quality short instruction data
Stage 2 SFT: Conversational QA with context (short)
Stage 3 SFT: Blend of Stage 1/2 data + LongAlpaca12k + OpenOrca (GPT-4) + Long Data Collections + Synthetic NarrativeQA (inserted summaries into long books)

Key Hyperparameters:

rope_base_frequency: 150,000,000 (increased from 500,000)
learning_rate_pretraining: 3e-5
learning_rate_sft: 3e-5
+ 3 more
batch_size: 32
pretraining_steps: 2000
pretraining_tokens: 8 Billion

Compute: Not reported in the paper

Comparison to Prior Work

vs. Llama-3-Gradient: ChatQA 2 uses continued pretraining on long data before SFT, rather than just SFT with interpolation
vs. Qwen2: ChatQA 2 focuses specifically on optimizing for RAG performance alongside long context, showing RAG superiority
vs. GPT-4-Turbo: ChatQA 2 is open-weights and matches/exceeds performance on specific long-context QA benchmarks

Limitations

NarrativeQA dataset had to be excluded from benchmarks because it was used for synthetic training data generation.
The model relies on a specific retriever (E5-Mistral) for optimal RAG performance.
Ultra-long context evaluation is limited to English tasks.

Reproducibility

Code: https://chatqa2-project.github.io/

publicly available (https://chatqa2-project.github.io/). Weights, training data, and evaluation setup are released. Note: NarrativeQA is used for training but excluded from evaluation benchmarks to prevent contamination.

📊 Experiments & Results

Evaluation Setup

Evaluated on Ultra-long context (>100K), Long context (<32K), and Short context RAG (<4K) benchmarks.

Benchmarks:

InfiniteBench (Ultra-long context (En.Sum, En.QA, En.MC, En.Dia))
ChatRAG Bench (Conversational QA and RAG (10 datasets))
LongBench / SCROLLS subset (Long context QA and Summarization (QMSum, Qasper, QuALITY, HotpotQA, etc.))

Metrics:

F1 score
ROUGE-L-Sum
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ultra-long context performance (InfiniteBench) shows ChatQA-2 competitive with or superior to proprietary models.
InfiniteBench En.QA	F1	48.8	56.6	+7.8
InfiniteBench En.Sum	ROUGE-L	14.9	19.3	+4.4
InfiniteBench En.MC	Accuracy	67.4	72.4	+5.0
RAG capability on short context benchmarks (ChatRAG Bench).
ChatRAG Bench (Average)	F1	51.3	52.9	+1.6
Comparison of Long Context vs. RAG approaches on the same models (32K benchmarks).
32K Benchmarks (Average)	Average Score	44.9	49.8	+4.9

Experiment Figures

Comparison of 'Long Context' vs 'RAG' approaches across different numbers of retrieved chunks (Top-k) on 32K benchmarks.

Main Takeaways

ChatQA-2-70B outperforms GPT-4-Turbo and Qwen2-72B on ultra-long context tasks (InfiniteBench) and standard RAG benchmarks.
RAG (Retrieval-Augmented Generation) consistently outperforms direct processing of the full long context when a sufficient number of chunks (e.g., top-20) are retrieved, even for models with 128K context windows.
The 'Needle in a Haystack' test is insufficient for evaluating real-world long-context performance; models that pass it may still fail on tasks like InfiniteBench.
Separating documents with special tokens (<s>) during pretraining is more effective than using <BOS>/<EOS> tokens for context extension in Llama-3.

📚 Prerequisite Knowledge

Prerequisites

Rotary Position Embeddings (RoPE) and base frequency scaling
Retrieval-Augmented Generation (RAG) pipelines
Instruction tuning methodologies

Key Terms

RoPE: Rotary Position Embedding—a method for encoding token positions in Transformers that generalizes well to sequence lengths longer than seen during training

base frequency: A hyperparameter in RoPE (theta) that controls the wavelength of position encodings; increasing it helps models handle longer sequences

RAG: Retrieval-Augmented Generation—fetching relevant text chunks to include in the model's prompt before generating an answer

E5-Mistral: A specific dense retrieval model capable of processing longer input queries/documents than standard BERT-based retrievers

SlimPajama: A large-scale, deduplicated dataset used for training language models

InfiniteBench: A benchmark suite designed to evaluate LLMs on tasks requiring ultra-long context windows (100K+ tokens)

ChatRAG Bench: A benchmark for evaluating conversational QA and RAG capabilities

NarrativeQA: A dataset used here for generating synthetic long-context training data by inserting summaries into long book documents