Give: Structured reasoning of large language models with knowledge graph inspired veracity extrapolation

📝 Paper Summary

Knowledge Graph Reasoning Reasoning with Limited External Knowledge

GIVE is a training-free framework that enhances LLM reasoning by extrapolating limited expert knowledge into broader concept groups and generating counterfactuals to suppress hallucinations.

Core Problem

LLMs struggle with scientific reasoning due to insufficient pre-trained knowledge, while full retrieval (RAG) fails when comprehensive knowledge bases are unavailable or too costly to build.

Why it matters:

Constructing comprehensive scientific knowledge bases is expensive and difficult due to non-standardized vocabulary.
Existing methods rely either on internal knowledge (prone to hallucination in niche domains) or perfect retrieval (often unavailable), failing in realistic low-resource settings.

Concrete Example: When asking if 'melatonin is effective for insomnia', a standard RAG might fail if the direct link is missing. GIVE infers the answer by linking melatonin to similar compounds and deducing relationships via intermediate biological concepts, rather than giving up or hallucinating.

Key Novelty

Graph Inspired Veracity Extrapolation (GIVE)

Expands single queried entities into 'Entity Groups' containing semantically similar concepts from the Knowledge Graph to broaden the reasoning scope.
Generates 'Veracity Extrapolation' by prompting the LLM to validate potential links between these groups using internal parametric knowledge.
Explicitly creates 'Counterfactual Knowledge' from rejected links to prevent the model from hallucinating incorrect connections during final answer generation.

Architecture

The GIVE workflow: Query → Entity Extraction → Entity Grouping (clustering similar KG nodes) → Relation Induction (finding paths) → Veracity Extrapolation (validating/rejecting paths) → Answer Generation.

Evaluation Highlights

GIVE enables GPT-3.5-Turbo to outperform GPT-4 on scientific tasks (PubmedQA, BioASQ) using only a small 135-node knowledge graph.
Improves accuracy by up to 43.5% → 88.2% on tasks extending beyond training data compared to standard prompting baselines.
Achieves highest win rate on TruthfulQA (50.3%) against RAG and ToG baselines, demonstrating effectiveness in open-domain reasoning.

Breakthrough Assessment

8/10

Strong methodological innovation in handling sparse/noisy knowledge. The ability for smaller models to beat larger ones in scientific domains via structured extrapolation is significant.

⚙️ Technical Details

Problem Definition

Setting: Reasoning on scientific and open-domain tasks using Large Language Models supported by limited external structured knowledge (sparse Knowledge Graphs).

Inputs: Natural language query x and a (potentially sparse or noisy) Knowledge Graph G.

Outputs: Final reasoning path and answer derived from integrated parametric and non-parametric knowledge.

Pipeline Flow

Query Extraction: Identify entities/relations in query
Entity Group Construction: Cluster queried entities with similar KG concepts
Inner-group Connection: Induce links within clusters
Inter-group Connection: Extrapolate links between clusters (Affirmative & Counterfactual)
Progressive Generation: Generate answer using all validated knowledge

System Modules

Query Information Extractor

Extract entity set E_x and relation set R_x from the natural language query

Model or implementation: LLM (e.g., GPT-3.5-Turbo)

Entity Group Constructor (Graph Reasoning)

Expand reasoning scope by finding semantically similar concepts in the KG for each extracted entity

Model or implementation: Pre-trained LM encoder (w)

Veracity Extrapolator (Graph Reasoning)

Validate potential relations between entity groups using LLM's internal knowledge

Model or implementation: LLM (e.g., GPT-3.5-Turbo)

Progressive Answer Generator

Generate final answer by progressively incorporating affirmative, counterfactual, and ground-truth knowledge

Model or implementation: LLM (Generator p_gamma)

Novel Architectural Elements

Veracity Extrapolation mechanism: Systematically probing potential cross-group edges and categorizing them as affirmative or counterfactual.
Entity Grouping strategy: Instead of retrieving single nodes, reasoning operates on clusters of semantically related concepts.
Progressive Answer Generation: A three-stage generation process (Affirmative → +Counterfactual → +Ground Truth) to stabilize reasoning.

Modeling

Base Model: Evaluated on GPT-3.5-Turbo, GPT-4, GPT-4o-mini, and Llama-3.1-70B-Instruct

Training Method: Training-free inference framework (prompt engineering + graph traversal)

Adaptation: None (In-context learning / Prompting)

Trainable Parameters: 0 (Frozen LLMs)

Key Hyperparameters:

top_k_concepts: Not specifically detailed beyond variable k
entity_group_size_n: 1 or 2 (found to be optimal in ablation)
KG_size_range: 135 to 840k nodes

Compute: Run time analysis in Appendix C.1 states complexity is O(r*m^2*n^2) LLM calls for inter-group connections, where m is groups, n is concepts, r is relations.

Comparison to Prior Work

vs. ToG: GIVE uses 'Entity Groups' and 'Counterfactuals' to handle sparse/noisy graphs, whereas ToG relies on finding existing paths which may not exist in sparse KGs.
vs. RAG: GIVE performs reasoning (extrapolation) on the retrieved nodes rather than just concatenating them, enabling answers when direct evidence is missing.
vs. CoT: GIVE grounds the chain of thought in external structured knowledge, reducing hallucination in scientific domains.
+ 1 more
vs. KAPING [not cited in paper]: KAPING retrieves triplets for zero-shot QA; GIVE goes further by generating counterfactuals and using entity groups for robustness.

Limitations

Computational cost increases with the number of entity groups and relations due to multiple LLM calls for verification.
Performance depends on the quality of the initial entity linking and the underlying pre-trained encoder for finding similar concepts.
Requires a structured Knowledge Graph, though it shows robustness to sparsity.

Reproducibility

Code: https://github.com/Jason-Tree/GIVE

Code is publicly available at https://github.com/Jason-Tree/GIVE. The paper uses standard benchmarks (PubmedQA, BioASQ, ProcessBank, CommonsenseQA) and specific subsets (e.g., small UMLS with 135 nodes). Exact prompt templates are referenced in Appendix.

📊 Experiments & Results

Evaluation Setup

Scientific and Open-Domain Question Answering with limited external knowledge.

Benchmarks:

PubmedQA (Biomedical QA (Yes/No/Maybe))
BioASQ (Biomedical QA (Factoid/List))
ProcessBank (Process Understanding QA)
CommonsenseQA (Commonsense Reasoning)
TruthfulQA (Open-domain Factuality (Generation))

Metrics:

Accuracy
Win Rate (for TruthfulQA using GPT-4o judge)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on scientific reasoning tasks (PubmedQA, BioASQ, ProcessBank) using a restricted 135-node UMLS KG. GIVE consistently outperforms baselines, and smaller models with GIVE often beat larger models without it.
PubmedQA	Accuracy	75.6	78.2	+2.6
ProcessBank	Accuracy	71.6	74.8	+3.2
CommonsenseQA	Accuracy	68.2	73.1	+4.9
CommonsenseQA (10% KG)	Accuracy	64.1	69.5	+5.4
TruthfulQA	Win Rate %	32.7	50.3	+17.6

Experiment Figures

Win rate of GIVE vs. baselines (RAG, ToG) on TruthfulQA as judged by GPT-4o.

Main Takeaways

GIVE effectively bridges the gap between small and large LLMs; GPT-3.5-Turbo + GIVE often surpasses GPT-4 on scientific tasks.
The framework is highly robust to knowledge graph sparsity, showing significant gains even when 90% of KG edges are removed (CommonsenseQA experiments).
The 'counterfactual' component is crucial: explicitly identifying rejected links prevents the hallucinations common in standard RAG when retrieving irrelevant information.
GIVE generalizes well to open-domain tasks (TruthfulQA), not just specialized scientific domains.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (entities, relations, triplets)
Large Language Models (LLMs) and prompting strategies
Retrieval-Augmented Generation (RAG)

Key Terms

Parametric Knowledge: Knowledge stored within the pre-trained weights of the LLM itself.

Non-parametric Knowledge: External information retrieved from a database or knowledge graph (KG) during inference.

Veracity Extrapolation: The process of inferring new valid connections between concepts by using the LLM to validate potential links suggested by the structure of external knowledge.

Entity Group: A cluster formed by a queried entity and its semantically similar concepts found in the Knowledge Graph.

Counterfactual Knowledge: Potential relationships between entities that the LLM explicitly rejects; these are used as negative context to prevent hallucination.

ToG: Think-on-Graph—a baseline method that explores knowledge graphs to answer questions.

UMLS: Unified Medical Language System—a compendium of many controlled vocabularies in the biomedical sciences.