Domainrag: A chinese benchmark for evaluating domain-specific retrieval-augmented generation

📝 Paper Summary

RAG Evaluation Domain-specific RAG

DomainRAG is a Chinese benchmark utilizing university enrollment data to evaluate Retrieval-Augmented Generation systems across six specific capabilities, revealing that current LLMs struggle with expert domains without external retrieval aid.

Core Problem

Existing RAG benchmarks predominantly rely on general knowledge (Wikipedia) which LLMs may have already memorized, failing to test true retrieval reliance and expert domain reasoning.

Why it matters:

Expert applications (finance, law, enrollment) require privacy-sensitive or long-tail data not present in LLM training sets.
Current benchmarks like NQ or HotpotQA test commonsense or hot topics, masking the model's inability to handle structural or noisy domain-specific data.
Evaluating faithfulness to external documents is impossible if the model can answer from internal memory.

Concrete Example: When asked about specific admission policies for a Chinese university, a standard LLM (Close-book) hallucinates or fails because the data is long-tail. Even with RAG, if the answer requires parsing an HTML table of admission scores, models often fail to extract the structure correctly compared to pure text.

Key Novelty

Domain-specific, Multi-faceted RAG Evaluation Benchmark

Constructs a dataset from a real-world, low-resource vertical (university enrollment) to ensure models cannot rely on parametric memory.
Decomposes RAG evaluation into six distinct capabilities: conversational intent, structural analysis (HTML), faithfulness, denoising, time-sensitivity, and multi-document integration.

Architecture

A conceptual diagram illustrating the six capabilities of RAG models evaluated in DomainRAG.

Evaluation Highlights

Retrieval-augmented settings significantly outperform closed-book LLMs on domain questions (e.g., Llama2-70B-chat jumps from ~3.6% to ~52.6% EM with Golden Reference).
HTML structural context improves performance over pure text for table-based questions (e.g., GPT-3.5 EM increases from 33.64 to 52.73 when using HTML).
Performance drops significantly in multi-document settings; for GPT-3.5, EM is 52.00 on single-doc extractive tasks but drops when integrating multiple sources (exact multi-doc numbers discussed in experiments).

Breakthrough Assessment

7/10

Provides a valuable, necessary shift from Wikipedia-based benchmarks to true domain-specific evaluation with a comprehensive breakdown of RAG sub-skills. However, limited to Chinese language and one specific domain (enrollment).

⚙️ Technical Details

Problem Definition

Setting: Domain-specific Question Answering (QA) where answers must be derived from external retrieved documents because the knowledge is not in the model's weights.

Inputs: User query q and a set of retrieved/provided domain documents D (text or HTML).

Outputs: Answer a.

Pipeline Flow

Query Generation (ChatGPT/GPT-4)
Document Retrieval (BM25 or Dense)
Generation (LLM)

System Modules

Data Generator

Generate synthetic QA pairs from crawled domain corpora (enrollment websites)

Model or implementation: ChatGPT and GPT-4

Retriever

Fetch relevant documents for the user query

Model or implementation: BM25 (sparse) or BGE-base-zh-v1.5 (dense)

Generator

Produce answer based on retrieved/provided context

Model or implementation: Llama2 (7B/13B/70B), Baichuan2, ChatGLM2, GPT-3.5

Novel Architectural Elements

Six-dimensional evaluation framework: Conversational, Structural, Faithful, Denoising, Time-sensitive, Multi-document interaction capabilities.

Modeling

Base Model: Evaluated multiple models: Llama2-7B/13B/70B-chat, Baichuan2-7B/33B, ChatGLM2-6B, GPT-3.5-turbo

Comparison to Prior Work

vs. Chen et al. (2024): DomainRAG uses domain-specific data (university enrollment) instead of Wikipedia to avoid data contamination/memorization.
vs. General RAG Benchmarks: Explicitly includes HTML structural analysis and Time-sensitive QA specific to vertical domains.

Limitations

Limited to a single domain (University Enrollment) and language (Chinese).
Reliance on proprietary models (GPT-4/ChatGPT) for data generation may introduce biases.
Fixed static dataset makes it hard to test real-time update capabilities without manual intervention.

Reproducibility

Code: https://github.com/ShootingWong/DomainRAG

Datasets and code are publicly available at https://github.com/ShootingWong/DomainRAG. The dataset construction involved crawling 1,686 web pages and processing them into 14,406 passages.

📊 Experiments & Results

Evaluation Setup

QA on university enrollment domain under Close-book, Golden reference, and Retrieved reference settings.

Benchmarks:

Extractive QA (Standard RAG QA) [New]
Structural QA (Table/HTML understanding) [New]
Conversational QA (Multi-turn QA) [New]
Multi-doc QA (Multi-document aggregation) [New]
Faithful QA (Hallucination/Faithfulness test) [New]
Time-sensitive QA (Temporal reasoning) [New]

Metrics:

EM (Exact Match - containment)
EMS (Strict Exact Match)
F1
Rouge-L
GE (GPT-4 Evaluation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Close-book vs. Golden Reference settings highlights the necessity of RAG in domain-specific tasks where internal knowledge is insufficient.
Extractive QA	EM (Exact Match)	3.60	52.60	+49.00
Extractive QA	EM (Exact Match)	20.80	73.00	+52.20
Structural QA experiments compare feeding raw text versus HTML code to the model, showing benefits of preserving structure.
Structural QA	EM (Exact Match)	33.64	52.73	+19.09
Structural QA	EM (Exact Match)	11.82	24.55	+12.73
Retrieval performance comparison shows sparse retrieval (BM25) often outperforming dense retrieval (BGE) in this specific domain.
Extractive QA	EM (Exact Match)	32.00	41.00	+9.00

Main Takeaways

Closed-book LLMs fail significantly on domain-specific questions (college enrollment), confirming the need for RAG in expert domains.
Preserving HTML structure (tables) in the context window drastically improves accuracy compared to flattening content into pure text.
BM25 (sparse retrieval) often outperforms dense retrieval in this specific domain, possibly due to the precise entity matching required for enrollment policies.
Models struggle with multi-document integration and time-sensitive questions, indicating these are key areas for future RAG improvement.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) concepts
Evaluation metrics for QA (EM, F1, Rouge)
Basic understanding of HTML structure vs. plain text

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents.

Close-book: Evaluating an LLM's ability to answer questions relying solely on its internal pre-trained knowledge without external documents.

Golden Reference: The ground-truth document that contains the correct answer, provided directly to the model to test its reasoning upper bound.

BM25: A probabilistic information retrieval function used to rank documents based on query term frequency.

EM: Exact Match—metric measuring if the prediction is strictly identical to or contains the ground truth.

Faithfulness: The ability of the model to stick to the provided external context rather than hallucinating or relying on internal (potentially outdated) memory.

Time-sensitive QA: Questions where the correct answer depends on a specific timestamp (e.g., admission scores for 2023 vs. 2024).

Noise Ratio: The proportion of irrelevant documents mixed with relevant ones to test the model's robustness.