Do-rag: A domain-specific qa framework using knowledge graph-enhanced retrieval-augmented generation

📝 Paper Summary

Modularized RAG pipeline Agentic RAG pipeline Graph-based RAG pipeline

DO-RAG automates domain-specific QA by building dynamic multimodal knowledge graphs via agents and fusing graph traversal with vector search to ground answers and mitigate hallucinations.

Core Problem

Standard RAG systems struggle with the complex, heterogeneous data in technical domains, leading to fragmented retrieval and hallucinations due to a lack of structured reasoning.

Why it matters:

Generic models often fail on domain-specific terminology or multi-step reasoning required for technical manuals and logs.
Existing KG-RAG hybrids face scalability bottlenecks because manual graph construction is labor-intensive and hard to maintain as knowledge evolves.
Loosely coupled retrieval and generation components cannot guarantee that final answers faithfully reflect the retrieved technical evidence.

Concrete Example: When answering a query about a specific database error from a technical manual, a standard RAG might retrieve loosely related text chunks but miss the causal relationship between the error code and a specific configuration parameter, leading to an incorrect diagnosis.

Key Novelty

Agentic Chain-of-Thought for Dynamic KG Construction & Hybrid Fusion

Uses a hierarchical team of agents (High, Mid, Low, Covariate) to automatically extract entities and relationships from unstructured multimodal docs into a Knowledge Graph.
Integrates retrieval by using graph traversal to find structured context, which is then used to refine the query for a subsequent vector search.
Implements a post-generation refinement step that explicitly cross-verifies the LLM's output against the graph evidence to correct hallucinations.

Architecture

The complete DO-RAG workflow from document ingestion to answer generation.

Evaluation Highlights

Achieved nearly 1.0 Contextual Recall and over 94% Answer Relevancy on the SunDB and Electrical domain benchmarks.
Outperformed FastGPT, TiDB.AI, and Dify.AI by up to 33.38% in composite scores.
DeepSeek-V3 with DO-RAG improved Answer Relevancy by 5.7% compared to vector-only retrieval baselines.

Breakthrough Assessment

8/10

Strong engineering integration of agentic KG construction with RAG. While the components are known, the automated end-to-end pipeline and significant performance gains over industrial baselines make it a practical breakthrough for domain-specific applications.

⚙️ Technical Details

Problem Definition

Setting: Domain-specific Question Answering using a document corpus D and a Knowledge Graph G

Inputs: Natural language query q

Outputs: Generated answer A grounded in D and G

Pipeline Flow

Document Ingestion -> KG Construction (Agents)
Query Decomposition -> Intent Analysis
Graph Retrieval -> Query Refinement
Vector Retrieval (using refined query)
Prompt Construction -> Generation -> Refinement

System Modules

Extraction Pipeline

Extracts structured data from documents to build the KG

Model or implementation: Multi-agent system (High/Mid/Low/Covariate agents)

Intent Analyzer (Retrieval & Selection)

Decomposes user query into sub-queries

Model or implementation: LLM-based

Graph Traverser (Retrieval & Selection)

Performs multi-hop traversal on the KG

Model or implementation: Graph traversal algorithm

Refinement Generator

Cross-verifies initial answer against KG and corrects inconsistencies

Model or implementation: LLM (DeepSeek/GPT-4o)

Novel Architectural Elements

Hierarchical agentic extraction pipeline (High/Mid/Low/Covariate levels) for automated KG construction.
Sequential hybrid retrieval where KG context is explicitly used to *rewrite* the query before vector search, rather than just merging results parallelly.

Modeling

Base Model: Evaluated with DeepSeek-R1, DeepSeek-V3, and GPT-4o-mini

Training Method: Inference-only RAG framework

Compute: Experiment conducted on NVIDIA A100 80GB GPU, 64GB RAM

Comparison to Prior Work

vs. FastGPT/TiDB.AI: DO-RAG uses a more complex multi-agent extraction pipeline and specific post-generation refinement for hallucination mitigation.
vs. Standard RAG: DO-RAG integrates dynamic KG construction end-to-end rather than using a static or external KG.
vs. GraphRAG [not cited in paper]: DO-RAG focuses heavily on the agentic extraction and the specific refinement loop for hallucination, whereas GraphRAG emphasizes community detection.

Limitations

Computational overhead of multi-agent extraction and hybrid retrieval is significant for real-time updates.
Reliance on LLMs means creative models (like DeepSeek-R1) can still hallucinate despite grounding.
Dataset size limited to 245 questions per domain, potentially missing edge cases.
Evaluation performed on proprietary (SunDB) and specific electrical datasets; generalization to broader open domains is untested.

📊 Experiments & Results

Evaluation Setup

Closed-domain QA on specialized technical datasets (SunDB and Electrical)

Benchmarks:

SunDB Dataset (Domain-specific QA (Database)) [New]
Electrical Dataset (Domain-specific QA (Electrical Engineering)) [New]

Metrics:

Answer Relevancy (AR)
Contextual Recall (CR)
Contextual Precision (CP)
Faithfulness (F)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Composite score comparisons against external baselines show DO-RAG (SunDB.AI) consistently outperforming existing platforms.
SunDB Dataset	Composite Score	0.685	0.985	+0.300
Ablation study demonstrates the specific contribution of the Knowledge Graph (KG) to performance metrics.
SunDB Dataset	Contextual Recall (CR)	0.977	1.000	+0.023
SunDB Dataset	Answer Relevancy (AR)	0.893	0.944	+0.051
SunDB Dataset	Contextual Precision (CP)	0.938	0.963	+0.025
SunDB Dataset	Faithfulness (F)	0.852	0.804	-0.048

Experiment Figures

Bar chart comparing Composite Scores of DO-RAG (SunDB.AI) vs FastGPT, TiDB.AI, and Dify.AI.

Main Takeaways

DO-RAG consistently achieves near-perfect Contextual Recall (~1.0) across tested domains, verifying the efficacy of the hybrid KG-Vector retrieval.
The integration of Knowledge Graphs improves Answer Relevancy and Precision significantly compared to vector-only baselines.
Specific models interact differently with the framework; while DeepSeek-V3 improved across all metrics, DeepSeek-R1 showed a slight regression in Faithfulness, suggesting trade-offs between model creativity and strict KG grounding.
The multi-agent extraction pipeline successfully handles heterogeneous data (tables, code, text), which is critical for technical domains like databases and electrical engineering.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Knowledge Graphs (KG)
Vector Embeddings
Chain-of-Thought (CoT) Prompting

Key Terms

MMKG: Multimodal Knowledge Graph—a structured representation of knowledge including entities, relations, and attributes extracted from text, tables, and images.

Chain-of-Thought: A prompting technique that encourages the model to generate intermediate reasoning steps before the final answer.

pgvector: An extension for PostgreSQL that enables storing and querying vector embeddings.

Hallucination: When an LLM generates information that is plausible-sounding but factually incorrect or unsupported by the source text.

Contextual Recall: A metric measuring the proportion of relevant information retrieved from the corpus relative to the ground truth.