Evaluation Setup
Zero-shot KGQA on Freebase. Models are tested on questions with relations/classes not seen during training.
Benchmarks:
- GrailQA (Complex KGQA with zero-shot generalization)
- WebQSP (Factoid KGQA)
- GraphQ (Complex KGQA)
Metrics:
- Exact Match (EM)
- F1 score (answer overlap)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| DARA consistently outperforms ICL-based GPT-4 agents across all three benchmarks in zero-shot settings. |
| GrailQA (Dev) |
F1 |
71.3 |
79.0 |
+7.7
|
| WebQSP |
F1 |
63.7 |
78.3 |
+14.6
|
| GraphQ |
F1 |
60.4 |
62.5 |
+2.1
|
| DARA significantly outperforms alternative fine-tuned agents (AgentLMs/AgentBench-7B) using the same base model size. |
| GrailQA (Dev) |
F1 |
60.0 |
75.7 |
+15.7
|
| DARA achieves competitive performance with state-of-the-art ranking-based systems. |
| GrailQA (Dev) |
EM |
77.2 |
79.0 |
+1.8
|
Main Takeaways
- Decoupling task decomposition from grounding allows smaller models to handle complex reasoning better than monolithic large models (GPT-4).
- Skim-then-deep-reading is crucial: it effectively handles the huge search space of Freebase without overwhelming the context window.
- Iterative decomposition prevents error propagation better than single-pass planning, as the agent can adjust based on intermediate retrieval results.
- Fine-tuning specific agent capabilities (DARA) is far more data-efficient (768 samples) than general instruction tuning for this task.