CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

📝 Paper Summary

Entity-based commonsense reasoning Knowledge Graph Question Answering (KGQA) Hallucination and Factuality

CoLoTa is a new benchmark of 3,300 queries designed to expose severe reasoning errors and hallucinations in LLMs when dealing with obscure, long-tail entities rather than popular ones.

Core Problem

Current LLMs perform well on commonsense reasoning about popular entities (e.g., Barack Obama) due to memorization but suffer high rates of hallucination and reasoning errors when the same logic is applied to obscure, long-tail entities.

Why it matters:

High-stakes applications require reliable reasoning regardless of entity popularity, but current benchmarks focus on head entities present in training data.
Existing KGQA datasets focus on factoid questions, ignoring the realistic need for combining factual retrieval with multi-step commonsense reasoning.
The specific impact of long-tail knowledge on *reasoning* (not just fact retrieval) has been underexplored.

Concrete Example: An LLM correctly answers 'Could Barack Obama and François Mitterrand have met while president?' by comparing dates. However, for the parallel query 'Could Liau Hiok-hian and Virginia Raggi have met while council members?', the same model hallucinates facts or fails the reasoning steps despite the logic being identical.

Key Novelty

Parallel Long-Tail Commonsense Benchmark (CoLoTa)

Constructs queries by systematically replacing popular 'head' entities in existing datasets (StrategyQA, CREAK) with obscure 'long-tail' counterparts from Wikidata.
Annotates each query with explicit inference rules, reasoning steps, and relevant Wikidata sub-graphs to support both LLM evaluation and Knowledge Graph Question Answering (KGQA).
Ensures all required factual knowledge exists in Wikidata, distinguishing reasoning failures from simple missing information.

Architecture

The workflow for constructing CoLoTa queries from original datasets.

Evaluation Highlights

State-of-the-art LLMs (including OpenAI-o1) show significantly higher hallucination rates on CoLoTa compared to original popular-entity queries.
KGQA methods demonstrate a severe inability to answer queries involving commonsense reasoning, failing to bridge the gap between factual retrieval and logical inference.
Validates that performance drops are due to entity obscurity, as the reasoning logic remains identical to the high-performance original queries.

Breakthrough Assessment

8/10

Significantly exposes the 'reasoning vs. memorization' gap in LLMs by isolating the variable of entity popularity. Provides a dual-purpose benchmark for both pure LLM reasoning and neuro-symbolic KGQA.

⚙️ Technical Details

Problem Definition

Setting: Entity-based Commonsense Reasoning and Knowledge Graph Question Answering (KGQA)

Inputs: Natural language query q (question or claim) targeting specific entities

Outputs: Boolean answer a_q ∈ {True, False} derived via commonsense reasoning over factual knowledge

Pipeline Flow

Query Selection (from StrategyQA/CREAK)
Entity Substitution (Head → Long-tail via Wikidata)
Annotation (Inference rules, Reasoning steps, KG facts)
Query Rewriting (Natural language refinement)

System Modules

Query Selection (Dataset Construction)

Select existing queries where factual knowledge exists in Wikidata

Model or implementation: Manual annotation

Entity Substitution (Dataset Construction)

Replace popular entities with obscure ones using SPARQL similarity search

Model or implementation: SPARQL on Wikidata

Annotation (Dataset Construction)

Define logic required to answer the new query

Model or implementation: Human Experts

Novel Architectural Elements

Systematic construction of parallel queries (Head vs. Long-tail) to isolate the 'popularity' variable while keeping reasoning logic constant
Dual-purpose annotation: Supports both text-based LLM reasoning evaluation AND structured KGQA evaluation via explicit Wikidata sub-graphs

Reproducibility

Code: https://github.com/D3Mlab/CoLoTa

publicly available (https://github.com/D3Mlab/CoLoTa). The dataset includes 3,300 queries, unique Wikidata QIDs, relevant Wikidata sub-graphs, inference rules, and reasoning steps. The paper does not mention releasing specific model weights or training code, as it is primarily a benchmark paper.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation of LLMs on Commonsense Reasoning and KGQA tasks using the CoLoTa benchmark vs. Original (StrategyQA/CREAK) queries.

Benchmarks:

CoLoTa (Long-tail Entity Commonsense Reasoning) [New]
StrategyQA (Original subset) (Multi-hop reasoning QA (Head entities))
CREAK (Original subset) (Claim verification (Head entities))

Metrics:

Accuracy
Hallucination Rate
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Distribution of entity popularity (measured by number of Wikidata triples) for CoLoTa entities vs. Original entities.

Distribution of reasoning skills required for CoLoTa queries (Question Answering vs. Claim Verification).

Main Takeaways

LLMs struggle significantly more with long-tail entities: Performance drops and hallucination spikes when the exact same reasoning logic is applied to obscure entities compared to popular ones.
Existing KGQA methods are ill-equipped for commonsense reasoning: They perform poorly on CoLoTa because they focus on factoid retrieval rather than the multi-step reasoning required by this benchmark.
The dataset covers diverse reasoning skills: Includes temporal, numeric, geographical, and domain-specific reasoning (sports, history), ensuring broad coverage.
CoLoTa serves as a diagnostic tool: It isolates 'reasoning capability' from 'fact memorization', proving that apparent reasoning success on popular entities often relies on memorized correlations.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Knowledge of Knowledge Graphs (specifically Wikidata)
Familiarity with Commonsense Reasoning tasks

Key Terms

Long-tail entities: Entities that appear infrequently in training corpora and real-world data, often leading to poorer model performance compared to popular (head) entities

KGQA: Knowledge Graph Question Answering—the task of answering natural language questions by retrieving and reasoning over structured facts in a knowledge graph

Hallucination: The generation of content by an LLM that contradicts ground truth facts or is nonsensical

StrategyQA: A benchmark dataset requiring multi-step implicit reasoning to answer True/False questions

CREAK: A benchmark dataset for claim verification requiring commonsense reasoning about entities

Wikidata: A large-scale, collaboratively edited, open knowledge graph

SPARQL: A semantic query language for databases, used to retrieve specific data from knowledge graphs like Wikidata

QID: Unique identifier for an item in Wikidata (e.g., Q42)

Inference rule: A logical statement (axiom) expressing the commonsense knowledge required to answer a query (e.g., 'If X has property Y, then Z')

Head entities: Popular, well-known entities that appear frequently in datasets (e.g., Barack Obama)