Core Knowledge Deficits in Multi-Modal Language Models

📝 Paper Summary

Multi-modal Large Language Models (MLLMs) Cognitive Evaluation Developmental Psychology in AI

MLLMs suffer from a fundamental deficit in core knowledge, performing worse on rudimentary tasks innate to infants than on complex reasoning, often relying on shortcut learning.

Core Problem

State-of-the-art MLLMs excel at high-level reasoning but consistently fail at rudimentary tasks intuitive to humans (like counting, spatial reasoning, and object permanence), creating a paradox.

Why it matters:

Current high-level excellence fails to generalize to real-world scenarios where small condition changes cause significant drops
The lack of foundational 'developmental start-up software' suggests models lack the grounding required for robust, genuine understanding
Dependence on spurious correlations (shortcuts) rather than causal understanding makes models brittle and vulnerable to perturbations

Concrete Example: In an object permanence test (Sensorimotor stage), a ball is hidden under a cup and shuffled. While a human infant tracks it easily, a massive MLLM fails to localize the object despite being able to solve complex math problems, showing a lack of basic physical understanding.

Key Novelty

CoreCognition Benchmark & Concept Hacking

CoreCognition: A large-scale benchmark of 1,503 questions covering 12 core abilities grounded in Piaget's developmental stages (Sensorimotor, Preoperational, Concrete/Formal Operational) to probe fundamental cognitive building blocks.
Concept Hacking: A controlled evaluation method that manipulates causal features in images to perturb ground-truth labels, distinguishing whether models possess genuine knowledge or rely on visual shortcuts.

Architecture

Overview of the CoreCognition benchmark distribution across the four Piagetian developmental stages.

Evaluation Highlights

GPT-o1 achieves 74.91% average accuracy on CoreCognition, significantly trailing human performance by 15.91 percentage points.
Models demonstrate a 'reversed' capability curve: consistently underperforming on low-level Sensorimotor abilities compared to high-level Formal Operational abilities, whereas humans perform consistently high on both.
Low-level core abilities exhibit little to no scalability with increased model parameters, unlike high-level abilities which improve with scale.

Breakthrough Assessment

8/10

Provides compelling evidence of a fundamental 'core deficit' in MLLMs using a rigorous cognitive science framework, challenging the assumption that scaling solves basic reasoning.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Multi-modal Large Language Models (MLLMs) on fundamental cognitive tasks

Inputs: Visual input (image/video) and a corresponding text question probing a specific core ability

Outputs: Selected answer from multiple choices (converted from free-form text)

Pipeline Flow

Prototype Scenario Design (Cognitive Science based)
Instantiation (Generating Images/Videos)
Circular Inference (Rotating Options)
Response Mapping (Template/LLM-Judge)
Scoring & Analysis

System Modules

Prototype Designer

Operationalize theoretical constructs (e.g., object permanence) into abstract test scenarios

Model or implementation: Human Experts

Circular Evaluator

Execute inference while mitigating position bias

Model or implementation: Target MLLM (e.g., GPT-4o, InternVL)

Response Mapper

Map free-form model outputs to valid MCQ options

Model or implementation: Hybrid (Template Matching + LLM-as-a-Judge)

Novel Architectural Elements

Taxonomy of 12 core abilities structured by Piagetian developmental stages (Sensorimotor to Formal Operational)
Integration of causal feature manipulation (Concept Hacking) into the evaluation pipeline to detect spurious correlations

Modeling

Base Model: Various (Evaluating 230 models including GPT-4o, Claude 3.5 Sonnet, Qwen2.5-VL, InternVL)

Training Method: Not applicable (Evaluation paper)

Adaptation: None (Inference only)

Trainable Parameters: 0

Compute: Inference conducted on varying environments for 230 models (exact compute not aggregated)

Comparison to Prior Work

vs. M3GIA/Marvel: CoreCognition targets early-emerging 'core' abilities (low-level) rather than high-level general intelligence.
vs. SEED-Bench: CoreCognition is grounded in developmental cognitive science (Piaget) rather than task taxonomies.
vs. DevBench: CoreCognition targets multi-modal knowledge, whereas DevBench focuses solely on language learning trajectories.

Limitations

Proprietary models (GPT-4o) still underperform humans significantly, suggesting architecture/data scaling isn't enough.
High-level performance is disconnected from low-level competence, indicating a lack of internal coherence.
Current MLLMs show absent scalability on low-level abilities compared to high-level ones.

Reproducibility

The paper describes a benchmark 'CoreCognition' with 1,503 samples. Code availability is not explicitly provided in the text. The evaluation uses official codebases of the target models. Human validation was performed via Amazon Mechanical Turk.

📊 Experiments & Results

Evaluation Setup

Large-scale evaluation of 230 MLLMs using circular evaluation on Multiple Choice Questions

Benchmarks:

CoreCognition (Multi-modal Core Knowledge Evaluation) [New]

Metrics:

Accuracy (%)
Pearson Correlation (between abilities)
Statistical methodology: Pairwise t-tests for performance differences; Pearson correlations for inter-ability dependencies

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of top MLLMs against human performance shows a significant deficit, particularly in rudimentary core knowledge.
CoreCognition	Average Accuracy	90.82	74.91	-15.91
CoreCognition	Average Accuracy	90.82	69.25	-21.57
CoreCognition	Average Accuracy	90.82	68.29	-22.53

Experiment Figures

Performance comparison between Humans and MLLMs across the four developmental stages.

Pearson correlation matrix of model performance across the 12 core abilities.

Main Takeaways

Core Knowledge Deficit: MLLMs perform significantly worse on low-level (Sensorimotor) abilities than high-level ones, contradicting the human developmental trajectory.
Misaligned Dependency: Mastery of high-level tasks in MLLMs does not correlate with the mastery of low-level prerequisite abilities (correlations often < 0.4), unlike in humans where these are scaffolded.
Scaling Failure: Increasing model size improves high-level reasoning but yields little to no improvement on core low-level abilities (zero or negative scaling), suggesting current scaling laws don't address core cognitive grounding.

📚 Prerequisite Knowledge

Prerequisites

Developmental Psychology (Piagetian stages)
Multi-modal Large Language Models (MLLM)
Shortcut Learning / Spurious Correlations

Key Terms

Core Knowledge: Fundamental cognitive abilities innate to humans or developed early (e.g., object permanence, counting) that underpin advanced reasoning

Sensorimotor Stage: Piaget's first developmental stage where infants develop concepts like object permanence through sensory interaction

Preoperational Stage: Piaget's second stage characterized by the development of symbolic representations

Concrete Operational Stage: Piaget's third stage involving systematic reasoning about numbers, motion, and perspective

Formal Operational Stage: Piaget's fourth stage involving abstract reasoning and intentionality

Circular Evaluation: An evaluation strategy that rotates answer options cyclically to mitigate position bias in multiple-choice questions

Concept Hacking: A proposed evaluation method that manipulates causal image features to test if models rely on shortcuts versus genuine concept understanding

MLLM: Multi-modal Large Language Model—AI models capable of processing and reasoning over both text and visual inputs

MCQ: Multiple-Choice Question