Cofe-rag: A comprehensive full-chain evaluation framework for retrieval-augmented generation with enhanced data diversity

📝 Paper Summary

RAG Evaluation RAG Benchmarks

CoFE-RAG evaluates the entire RAG pipeline using a new diverse dataset and multi-granularity keywords to assess retrieval without relying on rigid golden chunk annotations.

Core Problem

Existing RAG evaluations rely on limited data diversity (mostly plain text/factual queries) and use 'golden chunk' metrics that break when chunking strategies change, making it hard to diagnose specific pipeline failures.

Why it matters:

Current benchmarks like RAGAS or RGB focus on simple factual queries, failing to test complex analytical or tutorial reasoning required in real-world applications
Relying on annotated 'golden chunks' requires labor-intensive relabeling whenever the chunking strategy (e.g., size, overlap) is modified
End-to-end evaluation obscures whether errors stem from poor retrieval, bad reranking, or hallucinating generators

Concrete Example: If a user asks an analytical question like 'What level of innovation will China's intelligent cars reach by 2025?', standard metrics might miss relevant context if the chunking size changes and the text no longer aligns with pre-annotated 'golden chunks'. CoFE-RAG uses keyword constraints to match content regardless of chunk boundaries.

Key Novelty

Multi-Granularity Keyword Evaluation for Reference-Free Retrieval Assessment

Replaces static 'golden chunk' annotations with dynamic keyword-based matching: Coarse-grained keywords filter for relevance, while Fine-grained keyword lists check for specific information points
Evaluates the RAG pipeline step-by-step (chunking, retrieval, reranking, generation) rather than just the final answer
Introduces a benchmark with diverse source formats (PDF, PPT, Excel) and query types (Comparative, Analytical, Tutorial) beyond simple facts

Architecture

The workflow of CoFE-RAG framework showing the pipeline from Query/Document input to Multi-granular Keyword evaluation.

Evaluation Highlights

BGE-Large outperforms other embedding models on the new benchmark but still struggles with Tutorial queries (54.7% Accuracy vs 63.8% for Factual)
GPT-4o significantly outperforms other LLMs in generation, achieving a Correctness Score of 4.07, compared to 3.76 for Qwen2-7B
Larger chunk sizes (512 tokens) generally improve performance across retrieval and generation compared to smaller sizes (128/256 tokens) on this dataset

Breakthrough Assessment

7/10

Strong contribution to RAG evaluation methodology by removing the dependency on fixed chunk annotations and introducing diverse document formats. The dataset is valuable, though the core innovation is primarily in the evaluation metric design.

⚙️ Technical Details

Problem Definition

Setting: Full-chain evaluation of RAG systems encompassing chunking, retrieval, reranking, and generation stages

Inputs: A collection of diverse documents (PDF, PPT, etc.) and a query q

Outputs: Evaluation metrics for each stage (Recall/Accuracy for retrieval, correctness scores for generation)

Pipeline Flow

Data Construction (Document Parsing → Synthetic Generation → Manual Review)
Chunking (Split documents)
Retrieval (Embedding + Similarity Search)
Reranking (Re-order top-k)
Generation (LLM produces answer)
Evaluation (Multi-granularity keywords for retrieval; LLM-as-judge for generation)

System Modules

Data Construction

Generate queries and keywords from raw documents

Model or implementation: GPT-4 (Generator) + Human Annotators (Reviewer)

Evaluation Logic (Retrieval) (Evaluation)

Assess retrieved chunks without fixed gold labels

Model or implementation: Deterministic Keyword Matching

Evaluation Logic (Generation) (Evaluation)

Assess final answer quality

Model or implementation: LlamaIndex built-in evaluator (uses GPT-4)

Novel Architectural Elements

Multi-granularity keyword evaluation mechanism: Decouples retrieval evaluation from specific chunk boundaries by checking for the presence of key information spans (fine-grained keywords) within retrieved text, regardless of how that text was chunked.

Modeling

Base Model: Various (evaluated multiple models including GPT-4o, Qwen2, BGE-Large, etc.)

Training Method: Not applicable — this is an evaluation framework paper, not a model training paper.

Adaptation: None

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAGAS/ARES: CoFE-RAG provides a reference-required dataset with diverse document types, whereas RAGAS is primarily reference-free.
vs. RGB/RECALL: CoFE-RAG evaluates the full chain (including chunking/reranking) and uses multi-granularity keywords, whereas others focus mainly on generation or end-to-end results.
vs. CRUD-RAG [not cited in paper]: CRUD-RAG focuses on database interactions (Create, Read, Update, Delete), while CoFE-RAG focuses on diverse document formats and complex query types.

Limitations

Evaluation relies on the quality of GPT-4 generated keywords and reference answers (though human reviewed).
The 'Correctness' metric for generation still relies on LLM-as-a-judge (GPT-4), which may have biases.
The benchmark is relatively small (2826 final queries) compared to massive pre-training datasets.

Reproducibility

Code: https://github.com/Alibaba-NLP/CoFE-RAG

publicly available (https://github.com/Alibaba-NLP/CoFE-RAG). The repository contains the benchmark dataset (2826 queries across Chinese and English) and evaluation code. Specific prompt templates for data generation are described in the paper.

📊 Experiments & Results

Evaluation Setup

Full-chain RAG evaluation on the CoFE-RAG dataset

Benchmarks:

CoFE-RAG Benchmark (RAG (Chunking, Retrieval, Reranking, Generation)) [New]

Metrics:

Recall (Retrieval)
Accuracy (Retrieval/Reranking)
BLEU/Rouge-L (Generation)
Faithfulness/Relevance/Correctness (Generation - LLM Judge)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval performance comparison of different embedding models on the Chinese subset of the dataset.
CoFE-RAG (Chinese)	Accuracy	0.5669	0.6720	+0.1051
CoFE-RAG (Chinese)	Recall	0.6080	0.7190	+0.1110
Reranking performance comparison using bge-large for initial retrieval.
CoFE-RAG (Chinese)	Accuracy	0.6231	0.6306	+0.0075
Generation performance comparison of LLMs using GPT-4o as the judge for Correctness Score (1-5 scale).
CoFE-RAG (Chinese)	Score	3.1175	4.0777	+0.9602
CoFE-RAG (Chinese)	Score	3.7699	4.0777	+0.3078

Experiment Figures

Comparison of BLEU, Rouge-L, and Correctness scores across different query types (Factual, Analytical, Comparative, Tutorial) for Qwen2-7B, Llama2-7B, and GPT-4.

Effect of chunk size (128, 256, 512) on Accuracy (Retrieval/Reranking) and BLEU (Generation).

Main Takeaways

Retrieval models struggle significantly with complex query types: Analytical, Comparative, and Tutorial queries consistently show lower accuracy than Factual queries across all models.
Existing reranking methods provide limited improvement (or even degradation in recall) compared to utilizing all initially retrieved results, suggesting current rerankers may miss relevant chunks in complex scenarios.
Data diversity matters: The dataset's inclusion of PDF/PPT/Excel formats exposes weaknesses in systems optimized for simple plain text.
Larger chunks (512 tokens) consistently outperform smaller chunks (128/256) for both retrieval accuracy and generation quality on this benchmark.

📚 Prerequisite Knowledge

Prerequisites

RAG pipeline components (Chunking, Embedding, Reranking, Generation)
Information Retrieval metrics (Recall, Hit Rate)
LLM evaluation techniques (Reference-based vs. Reference-free)

Key Terms

Coarse-grained keywords: Representative words extracted from query/context used as an initial filter for retrieved chunk relevance (e.g., 'intelligent cars')

Fine-grained keywords: Lists of specific information points (spans of text) extracted from context that must be present to answer the query accurately

Golden chunks: Pre-annotated specific text segments considered the 'correct' retrieval target; fragile because they become invalid if chunking strategies change

Factual query: Queries seeking specific, clear facts or evidence (e.g., 'Where is the capital of the US?')

Analytical query: Queries seeking analysis for concepts or terms (e.g., 'Why is the earth warming?')

Comparative query: Queries seeking comparisons across dimensions (e.g., 'Differences between A and B?')

Tutorial query: Queries seeking steps to perform a task (e.g., 'Steps to install TensorFlow?')

Pass Rate: The ratio of generated examples where the LLM-evaluator assigns a correctness score greater than or equal to 4 (on a 1-5 scale)