Generating diverse Q&A benchmarks forRAGevaluation with DataMorgana

📝 Paper Summary

Modularized RAG pipeline Evaluation methodology

DataMorgana generates highly diverse synthetic RAG benchmarks by combining customizable user personas and question categories to simulate realistic traffic patterns rather than relying on generic LLM generations.

Core Problem

Existing synthetic question generation methods produce monotonous benchmarks that lack the diversity of real user interactions, failing to reflect actual traffic patterns.

Why it matters:

Indiscriminate LLM generation leads to benchmarks that do not cover the different ways end-users interact with RAG systems
Lack of diversity in evaluation sets risks overfitting RAG solutions to specific question types while failing on others
Real query logs are often unavailable for new or specialized domains, making high-quality synthetic data essential

Concrete Example: Standard methods might repeatedly generate simple factoid questions from a document. In contrast, a real 'clinical researcher' user might ask for a comparison of trends, while a 'patient' might ask for basic symptom checking—a distinction missed by generic generators.

Key Novelty

Combinatorial Category-Driven Generation

Defines mutually exclusive categories for both 'users' (e.g., expert, novice) and 'questions' (e.g., factoid, reasoning) via natural language descriptions
systematically samples combinations of these categories to prompt the LLM, enforcing diversity through explicit constraints rather than random sampling
Allows non-technical users to configure distribution probabilities for each category to match expected real-world traffic

Architecture

The generation workflow of DataMorgana

Evaluation Highlights

Produces significantly higher lexical, syntactic, and semantic diversity compared to Vanilla, Know Your RAG, and DeepEval methods across multiple metrics
Demonstrates effectiveness on both domain-specific (CORD-19) and general-knowledge (Wikipedia) corpora
Achieves high fidelity in manual validation, with near-perfect relevance and text quality for generated questions

Breakthrough Assessment

7/10

Strong methodological contribution for evaluation. While not a new model architecture, it addresses a critical gap in RAG evaluation (benchmark diversity) with a flexible, user-centric approach.

⚙️ Technical Details

Problem Definition

Setting: Synthetic generation of Question-Answer (Q, A) pairs from a corpus of documents D

Inputs: A document collection D and a configuration file specifying user/question categories

Outputs: A synthetic benchmark set of (Question, Answer) pairs

Pipeline Flow

Configuration (User inputs JSON defining categories)
Prompt Instantiation (System selects category combinations)
Generation (LLM creates candidate pairs)
Filtering (System validates candidates)

System Modules

Configuration Loader

Parses JSON configuration defining user personas (e.g., 'Student', 'Expert') and question types (e.g., 'Reasoning', 'Factoid') with associated probabilities

Model or implementation: Rule-based

Prompt Instantiator (Generation)

Constructs a dynamic prompt by sampling a document d_i, a user category u_i, and a question category c_j based on defined probabilities

Model or implementation: Rule-based

LLM Generator (Generation)

Generates k candidate (question, answer) pairs based on the instantiated prompt

Model or implementation: Claude-3.5 Sonnet v2 (in experiments)

Filter

Verifies candidates against constraints (context-free, faithful to document, matches categories)

Model or implementation: LLM-based verification (implicit in description)

Novel Architectural Elements

Two-stage configuration-then-generation workflow driven by combinatorial category sampling
Explicit injection of user persona and question type descriptions into generation prompts to force diversity

Modeling

Base Model: Claude-3.5 Sonnet v2

📊 Experiments & Results

Evaluation Setup

Synthetic question generation from two corpora (Medical and General Knowledge)

Benchmarks:

CORD-19 (Domain-specific Q&A generation (Medical))
Wikipedia (NQ subset) (General knowledge Q&A generation)

Metrics:

Lexical diversity
Syntactic diversity
Semantic diversity
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

DataMorgana produces questions with higher diversity across lexical, syntactic, and semantic dimensions compared to Vanilla, Know Your RAG, and DeepEval baselines.
The tool successfully adapts to domain-specific requirements (e.g., defining 'Patients' vs 'Doctors' for CORD-19) via simple configuration changes.
Manual annotation confirms high fidelity (relevance/correctness) of individual questions, validating that increased diversity does not come at the cost of quality.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation) pipelines
Familiarity with LLM-based synthetic data generation
Basic concepts of evaluation metrics (fidelity, diversity)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Synthetic Data: Data artificially generated by AI models (often LLMs) to train or evaluate other models when real data is scarce

Fidelity: The quality of synthetic samples—whether the generated questions are fluent, coherent, and relevant

Diversity: The extent to which synthetic samples cover the full variability of potential real-world inputs

CORD-19: COVID-19 Open Research Dataset—a corpus of scientific papers about COVID-19

Factoid question: A question that can be answered with a concise fact or short statement

NQ dataset: Natural Questions dataset—a benchmark consisting of real user questions issued to Google Search