On the Consistency of Commonsense in Large Language Models

📝 Paper Summary

Commonsense Reasoning Knowledge Evaluation Benchmark Construction

CoCo is an automatically constructed benchmark that evaluates whether LLMs consistently understand and apply commonsense knowledge rather than just memorizing it, revealing significant gaps between retrieval and reasoning.

Core Problem

Existing commonsense evaluations focus on downstream task performance, failing to distinguish whether correct answers stem from true understanding or rote memorization, and often suffer from data contamination.

Why it matters:

High accuracy on standard benchmarks may mask a lack of genuine reasoning ability if models merely memorize the test set
Current methods rely heavily on manual annotation, making large-scale, consistent evaluation of specific knowledge triples difficult
Without measuring consistency, it is impossible to know if an LLM's reasoning failure is due to a lack of knowledge or an inability to apply it

Concrete Example: A model might correctly answer a complex reasoning question about 'PersonX playing football', yet fail to answer a simple direct question about the specific effect (getting tired) that validates it knows the underlying fact.

Key Novelty

Consistency of Commonsense (CoCo) Benchmark

Constructs a three-tier evaluation (Memorization, Comprehension, Application) derived from the same underlying Atomic knowledge triples
Reverses the typical evaluation direction: starts with atomic facts from a Knowledge Graph, then generates conceptual and multi-hop reasoning questions based on those specific facts
Introduces consistency metrics (e.g., FAISCORE) that only credit reasoning success if the model also demonstrates possession of the prerequisite atomic knowledge

Architecture

The automated data construction pipeline for CoCo.

Evaluation Highlights

GPT-4 achieves the highest performance but still lags behind human performance by 17.7% on average across tasks
While GPT-4 scores 81.66% on Memorization, its Faithfulness Score (consistency between knowledge and reasoning) drops to 55.49%, indicating frequent hallucination or lucky guesses
KnowCoT (Knowledge-based Chain-of-Thought) improves GPT-4's Application Score by ~3.7 points over vanilla prompting, enhancing consistency

Breakthrough Assessment

7/10

Strong methodological contribution in automated benchmark construction and consistency metrics. The results expose a critical weakness in current LLMs (knowledge-reasoning disconnect), though the approach relies on existing knowledge graphs.

⚙️ Technical Details

Problem Definition

Setting: Multiple Choice Question Answering (MCQA) across three interconnected tasks: Memorization, Comprehension, and Application

Inputs: Natural language question q derived from a knowledge triple k or logical query

Outputs: Predicted answer option p_k (A, B, C, D, or E)

Pipeline Flow

Atomic Knowledge Sampling (select diverse triples from ATOMIC)
Verbalization (convert triples to natural language QA for Memorization)
Conceptualization (generate abstract concepts via GPT-2 for Comprehension QA)
Logical Query Generation (construct multi-hop paths for Application QA)

System Modules

Triple Sampler (Data Construction)

Select diverse and representative knowledge triples from CSKG

Model or implementation: Sentence-BERT (for embedding-based diversity sampling)

Concept Generator (Data Construction)

Generate abstract concepts from specific events

Model or implementation: GPT-2 fine-tuned on ABSTRACT ATOMIC

Verifier (Data Construction)

Filter out low-plausibility generated concepts and reasoning chains

Model or implementation: Vera (T5-based plausibility scorer)

Evaluator

Assess LLM performance on generated MCQA tasks

Model or implementation: Various LLMs (e.g., GPT-4, Llama-3)

Novel Architectural Elements

Hierarchical dataset construction: Application questions are strictly derived from the exact same atomic facts used in Memorization questions to enable consistency checking

Modeling

Base Model: Evaluated on multiple models: Mistral-7B-Instruct-v0.3, Llama3-8B-Instruct, Qwen2.5-7B/14B-Instruct, Llama2-13B-Chat, GPT-3.5-turbo, GPT-4o

Training Method: Data generation pipeline uses fine-tuned GPT-2; Evaluation uses zero-shot and CoT prompting

Compute: Not reported in the paper

Comparison to Prior Work

vs. KoLA: CoCo enforces intrinsic connections between memorization and application samples, whereas KoLA treats them as separate tasks
vs. CHARM: CoCo automatically generates dependency pairs from CSKGs, scaling to 39K samples compared to CHARM's manual annotation
vs. CommonsenseQA: CoCo tests specific atomic knowledge consistency rather than just downstream accuracy

Limitations

Does not fully encompass dimensions like temporal reasoning or causality beyond ATOMIC's scope
Reliance on predefined templates for verbalization may not reflect natural language diversity
Evaluation depends on the quality and coverage of the underlying ATOMIC knowledge graph
Metrics like FAISCORE assume the model uses the specific annotated knowledge path, though other valid paths might exist

Reproducibility

Code: https://github.com/1iguozheng/CoCo

publicly available (https://github.com/1iguozheng/CoCo). The dataset contains ~39K samples. Code for metric calculation is included. The specific fine-tuned GPT-2 weights for conceptualization are not explicitly linked but the method is described.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Chain-of-Thought (CoT) prompting on MCQA tasks

Benchmarks:

CoCo (Commonsense Consistency Benchmark) [New]

Metrics:

MEMSCORE (Memorization Accuracy)
COMSCORE (Comprehension Score)
REASCORE (Reasoning Accuracy)
FAISCORE (Faithfulness Score)
APPSCORE (Application Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CoCo	MEMSCORE (Memorization)	90.53	79.81	-10.72
CoCo	FAISCORE (Faithfulness)	90.25	49.68	-40.57
CoCo	APPSCORE (Application)	71.43	75.12	+3.69
CoCo	REASCORE (Reasoning)	52.35	65.16	+12.81
CoCo	FAISCORE (Faithfulness)	49.68	55.49	+5.81

Experiment Figures

The KnowCoT prompting strategy illustration.

Error analysis charts for REASCORE, FAISCORE, and APPSCORE across different models.

Main Takeaways

Significant gap exists between LLM reasoning accuracy (REASCORE) and Faithfulness (FAISCORE), suggesting high performance on benchmarks often relies on shallow heuristics or memorization rather than consistent logic.
KnowCoT prompting consistently improves performance across all metrics, particularly in Application and Faithfulness, by forcing the model to explicitly recall knowledge first.
Performance degradation is observed in smaller models (e.g., Mistral-7B), which struggle more with knowledge memorization compared to GPT-4.
LLMs show uneven mastery of different commonsense relations; for example, performing well on 'xIntent' (intentions) but weaker on 'oEffect' (effects on others).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Commonsense Knowledge Graphs (CSKGs) like ATOMIC
Familiarity with Chain-of-Thought (CoT) prompting
Basic knowledge of logical queries (intersection, projection) over graphs

Key Terms

CSKG: Commonsense Knowledge Graph—a structured database of everyday concepts and their relationships (e.g., 'eating' causes 'fullness')

ATOMIC: A specific CSKG focused on if-then reasoning about events, social interactions, and mental states

Atomic Knowledge: A single triple (head, relation, tail) representing a basic fact, e.g., (PersonX plays football, xEffect, PersonX feels tired)

KnowCoT: A prompting strategy proposed in this paper that explicitly instructs the model to recall relevant knowledge before deducing an answer

Conceptualization: The process of abstracting a specific event (e.g., 'playing football') into a higher-level concept (e.g., 'playing tiring sports')

Logical Queries: Structured requests to find entities in a graph that satisfy specific logical conditions (e.g., 'what is the effect of X AND the effect of Y')

MEMSCORE: Metric measuring the accuracy of answering direct questions about atomic knowledge triples

FAISCORE: Faithfulness Score—measures the percentage of correctly answered reasoning questions where the model also correctly answered the prerequisite knowledge questions

APPSCORE: Application Score—measures the percentage of reasoning questions answered correctly, GIVEN that the model demonstrated it memorized the required knowledge