DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs

📝 Paper Summary

Agentic reasoning over Knowledge Graphs (KGQA) Neural-symbolic reasoning

DARA is a hierarchical agent framework that improves knowledge graph question answering by disentangling high-level iterative planning from low-level schema alignment and logical reasoning, finetuned on open-source LLMs.

Core Problem

Existing LLM agents for KGQA struggle with structured data environments because they mix planning and grounding, or rely on expensive proprietary models (GPT-4) via In-Context Learning which perform poorly compared to classical methods.

Why it matters:

Classical ranking-based methods are brittle and require expert rules, while current LLM agents lag in accuracy on structured tasks.
ICL-based agents using commercial models raise privacy/cost concerns (e.g., ~$1,300 for one test run vs ~$30 for DARA).
Previous fine-tuned agents (e.g., AgentBench) fail to capture the hierarchical nature of KGQA, leading to subpar performance on unseen schemas.

Concrete Example: For the question 'Who is the vice president serving under the president represented by localized_1?', a standard agent might hallucinate relations or fail to decompose the multi-hop requirement. DARA first identifies the task 'Find president represented by localized_1', grounds it to 'find_president', then iteratively identifies the next task 'Find vice president serving under...', effectively chaining the logic.

Key Novelty

Hierarchical Decomposition-Alignment-Reasoning

Explicitly separates high-level planning (what to do next) from low-level grounding (interacting with the KG to find specific schema items), unlike previous flat reasoning chains.
Introduces a 'skim-then-deep-reading' strategy for relation selection: the agent scans many relations quickly, selects top candidates, and only then reads full descriptions to finalize the choice.
Uses iterative task decomposition where the agent generates one subtask at a time based on the execution result of the previous step, rather than planning everything upfront.

Architecture

The iterative workflow of DARA interacting with a knowledge graph.

Evaluation Highlights

Outperforms GPT-4 (ICL) by +7.7% F1 on GrailQA and +14.6% F1 on WebQSP in zero-shot settings using a much smaller Llama-2-7B base.
Surpasses fine-tuned AgentBench-7B by ~15-20% F1 across benchmarks, proving the superiority of the hierarchical framework over flat agent tuning.
Achieves parity with state-of-the-art ranking-based methods (Pangu) on GrailQA (79.0 vs 77.2 F1) while using significantly less training data (768 trajectories vs fully supervised).

Breakthrough Assessment

8/10

Significantly closes the gap between LLM agents and specialized symbolic solvers for KGQA. Demonstrates that small open-source models can beat GPT-4 on structured reasoning if the agentic framework is well-designed.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot Question Answering over Knowledge Graphs (KGQA) where the model must generalize to unseen schemas (relations/classes) during inference.

Inputs: Natural language question Q and access to a Knowledge Graph G.

Outputs: Executable logical form L (S-expression) that retrieves the correct answer from G.

Pipeline Flow

Group: Iterative Planning loop
Task Decomposition (High-level Planner)
Task Grounding (Low-level execution)
Group: Grounding Components
Schema Item Selection (Skim-then-deep-reading)
Logical Form Construction

System Modules

Task Decomposer

Break down the question or remaining intent into the next immediate subtask.

Model or implementation: Llama-2-7B / Mistral-7B (Fine-tuned)

Relation/Class Retriever (Grounding)

Fetch relevant schema items (relations/classes) from the KG for the current task.

Model or implementation: all-mpnet-base-v2 (Bi-encoder, Fine-tuned)

Schema Selector (Grounding)

Select the exact schema item using 'skim-then-deep-reading'.

Model or implementation: Llama-2-7B / Mistral-7B (Fine-tuned)

Logical Form Constructor

Assemble the selected schema and previous logic into an executable S-expression step.

Model or implementation: Llama-2-7B / Mistral-7B (Fine-tuned)

Novel Architectural Elements

Dual-loop mechanism: High-level loop for task decomposition and low-level loop for schema grounding.
Skim-then-deep-reading module: A two-step selection process where the agent first filters relations by name ('skimming') and then requests detailed descriptions only for the top candidates ('deep reading') to save context window and improve accuracy.

Modeling

Base Model: Llama-2-7B, Llama-2-13B, Mistral-7B, CodeLlama-7B

Training Method: Full fine-tuning (Supervised)

Objective Functions:

Purpose: Standard language modeling loss.

Formally: Autoregressive cross-entropy loss on the reasoning trajectories.

Adaptation: Full fine-tuning

Training Data:

768 high-quality reasoning trajectories derived from GraphQ, WebQSP, and GrailQA.
Trajectories generated by GPT-4 translation of linearized S-expressions, followed by human verification.

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: Llama-2: 6 epochs, Mistral: 4 epochs
+ 2 more
scheduler: Cosine
warmup_ratio: 0.03

Compute: Training: 2-4 NVIDIA H100 (80GiB) GPUs. Inference: 1 NVIDIA A100 (40GiB) GPU.

Comparison to Prior Work

vs. AgentBench: DARA uses hierarchical planning/grounding instead of a flat tool-use loop.
vs. Pangu: DARA is an autonomous agent navigating the graph, whereas Pangu enumerates all paths and ranks them (brute force).
vs. StructGPT: DARA generates full logical forms (semantic parsing) rather than just retrieving context for QA [not cited in paper].

Limitations

Relies on entity linking being provided or performed by an external tool (evaluated with gold/system entities).
Training data generation via GPT-4 required human verification due to quality issues.
The skim-then-deep-reading process increases the number of model calls per reasoning step compared to single-pass selection.

Reproducibility

Code: https://github.com/UKPLab/acl2024-DARA

Code and model released at https://github.com/UKPLab/acl2024-DARA. Training data (768 trajectories) is provided. Requires KG server setup (e.g., Freebase) which can be complex.

📊 Experiments & Results

Evaluation Setup

Zero-shot KGQA on Freebase. Models are tested on questions with relations/classes not seen during training.

Benchmarks:

GrailQA (Complex KGQA with zero-shot generalization)
WebQSP (Factoid KGQA)
GraphQ (Complex KGQA)

Metrics:

Exact Match (EM)
F1 score (answer overlap)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DARA consistently outperforms ICL-based GPT-4 agents across all three benchmarks in zero-shot settings.
GrailQA (Dev)	F1	71.3	79.0	+7.7
WebQSP	F1	63.7	78.3	+14.6
GraphQ	F1	60.4	62.5	+2.1
DARA significantly outperforms alternative fine-tuned agents (AgentLMs/AgentBench-7B) using the same base model size.
GrailQA (Dev)	F1	60.0	75.7	+15.7
DARA achieves competitive performance with state-of-the-art ranking-based systems.
GrailQA (Dev)	EM	77.2	79.0	+1.8

Experiment Figures

Impact of training data size on performance (WebQSP and GrailQA).

Main Takeaways

Decoupling task decomposition from grounding allows smaller models to handle complex reasoning better than monolithic large models (GPT-4).
Skim-then-deep-reading is crucial: it effectively handles the huge search space of Freebase without overwhelming the context window.
Iterative decomposition prevents error propagation better than single-pass planning, as the agent can adjust based on intermediate retrieval results.
Fine-tuning specific agent capabilities (DARA) is far more data-efficient (768 samples) than general instruction tuning for this task.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graph schemas (entities, relations, classes)
S-expressions (Lisp-like logical forms)
Agentic workflows (planning, tool use, reasoning)

Key Terms

KGQA: Question Answering over Knowledge Graphs—answering natural language questions by querying a structured graph database.

S-expression: Symbolic expression—a nested list structure used here as a logical query language (similar to SPARQL) to execute against the knowledge graph.

Zero-shot evaluation: Testing the model on questions involving KG relations and classes that were never seen during training.

ICL: In-Context Learning—prompting an LLM with examples in the input context without updating its weights.

Entity Linking: The process of identifying which nodes in the Knowledge Graph correspond to mentions in the user's question.

Bi-encoder: A retrieval model that encodes queries and documents (here, relations) into separate vectors to find similarities.

Hits@1: A metric measuring if the single highest-ranked answer is correct.