LLMs as Models for Analogical Reasoning

📝 Paper Summary

Analogical Reasoning Cognitive Modeling LLM Evaluation

Advanced LLMs match human performance on complex analogical tasks requiring semantic re-representation, but failure on specific stress tests suggests they employ different underlying mechanisms than humans.

Core Problem

Humans flexibly re-represent concepts (e.g., viewing 'dog' as 'mammal' vs. 'four-legged') to solve analogies, but existing cognitive models (like SMT or Copycat) cannot handle this flexibility for rich semantic concepts.

Why it matters:

Analogical reasoning is fundamental to human intelligence, allowing generalization from familiar to unfamiliar domains
It is debated whether LLMs' emergent reasoning capabilities are robust and human-like or merely superficial statistical artifacts
Existing computational models of analogy either lack semantic richness or cannot dynamically re-represent concepts, leaving a gap in cognitive theory

Concrete Example: In the task 'ant => !!!!!!!, dog => ****, human => ?', the solver must re-represent concepts by 'number of legs' and 'mammal status' to output the correct symbols. Humans do this naturally; it is unclear if LLMs can induce such rules zero-shot.

Key Novelty

Evaluation of LLMs on novel 'Semantic Structure' and 'Semantic Content' analogy tasks

Introduces tasks requiring mapping between semantic words and abstract symbol sequences, forcing the solver to induce rules based on specific semantic attributes (e.g., size, biology) rather than just surface text patterns
Systematically manipulates task presentation (permutations, distractors, removal of semantic content) to diagnose whether models rely on genuine structural mapping or superficial heuristics

Architecture

Example of a 'Defaults' condition question in the Semantic Structure experiment

Evaluation Highlights

GPT-4 and Claude 3 match human performance in the 'Defaults' condition (~0.8-0.9 reference match), suggesting potential for human-level analogical inference
Human performance is robust to permuting the order of example pairs (~0.85 match), whereas GPT-4 performance drops significantly (~0.55 match) in the 'Permuted Pairs' condition
Humans ignore misleading semantic inputs in 'Randoms' condition (maintaining ~0.8 match using only symbols), while GPT-4 drops drastically (~0.3 match), failing to switch strategies

Breakthrough Assessment

7/10

Strong behavioral analysis revealing that while LLMs mimic human outputs on analogy tasks, their sensitivity to order and distractors proves their internal mechanisms differ significantly from human reasoning.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot completion of analogical mapping tasks where a Source Domain (words) maps to a Target Domain (symbols) via a hidden rule

Inputs: A 'quiz' of 3 example pairs (Word -> Symbol Sequence) and a 4th query Word (e.g., 'square => CCC rectangle => ccc circle => CC oval => ')

Outputs: Completion of the target symbol sequence for the final query word

Pipeline Flow

Input Construction (Task-specific prompt with 3 examples + 1 query)
Model Inference (LLM generates completion)
Evaluation (Compare generated string to reference answer)

System Modules

LLM Inference

Generate the completion for the analogy quiz

Model or implementation: Various (GPT-4, Claude 3, Llama-405B, etc.)

Modeling

Base Model: Evaluated multiple models: GPT-4 (gpt-4-0314), Claude 3 Opus, Llama-405B, GPT-3 (davinci-002), Pythia-12B, Falcon-40B, Claude 2

Training Method: Not applicable — paper evaluates off-the-shelf pre-trained models only

Compute: Not reported in the paper

Comparison to Prior Work

vs. Webb et al.: Introduces 'flexible re-representation' tasks using rich semantic concepts rather than just formal symbols or standard verbal analogies
vs. Copycat: Applies reasoning to open-domain semantic concepts (animals, objects) rather than synthetic domains (letter strings)
vs. Lewis & Mitchell (2024) [not cited in paper]: Investigates robustness to distractors/permutations to distinguish genuine reasoning from memorization, finding distinct failure modes in LLMs

Limitations

LLMs' training data contamination cannot be fully ruled out, though tasks were designed to be novel
Prompt accumulation method for LLMs is an approximation of human working memory
Analysis focuses on behavioral outputs; internal activations or attention mechanisms are not inspected

Reproducibility

Code: https://github.com/smusker/LLM_Analogical_Reasoning

Dataset and code publicly available at https://github.com/smusker/LLM_Analogical_Reasoning. Experiments use standard API calls (OpenAI, Anthropic, Fireworks) or local inference for open weights models.

📊 Experiments & Results

Evaluation Setup

Zero-shot completion of 4-question quizzes. Human subjects compared against LLMs.

Benchmarks:

Semantic Structure Task (Mapping semantic relations to symbol patterns) [New]
Semantic Content Task (Mapping semantic attributes (categorial/numeric) to symbols) [New]

Metrics:

Match to reference (Proportion of answers matching the intended rule-based solution)
Statistical methodology: Logistic regression with interaction terms (Subject Type × Condition). Significance testing using likelihood ratio tests.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the 'Defaults' condition shows advanced LLMs match human abilities, but 'Permuted Pairs' reveals model fragility.
Semantic Structure (Defaults)	Match to reference	0.85	0.85	0.00
Semantic Structure (Permuted Pairs)	Match to reference	0.85	0.55	-0.30
Semantic Structure (Permuted Pairs)	Match to reference	0.85	0.40	-0.45
The 'Randoms' condition tests if subjects can ignore misleading semantic text and solve using only symbols (RHS). Humans adapt; models fail.
Semantic Structure (Randoms)	Match to reference	0.80	0.30	-0.50
Semantic Structure (2xN)	Match to reference	0.70	0.35	-0.35

Experiment Figures

Comparison of Human vs. LLM performance on Defaults vs. Permuted Pairs conditions

Performance on Only RHS (symbols only) vs. Randoms (misleading words + symbols)

Main Takeaways

Advanced LLMs (GPT-4, Claude 3) can match human performance on complex analogy tasks involving semantic re-representation in standard conditions.
LLMs are highly sensitive to presentation order (Permuted Pairs) and irrelevant semantic information (Randoms), whereas humans are robust to these factors.
In conditions where semantic structure is misleading (Randoms), humans successfully switch to a symbol-only strategy, while LLMs fail to disengage from the semantic content, indicating a lack of strategic flexibility.
While LLMs provide a 'how-possibly' explanation for analogical behavior (emerging from statistical learning), the mechanistic divergence in stress tests suggests they do not provide a 'how-actually' explanation of human cognition.

📚 Prerequisite Knowledge

Prerequisites

Basics of analogical reasoning (source vs. target domain, mapping, inference)
Zero-shot and few-shot prompting of LLMs
Cognitive science concepts of schema induction and re-representation

Key Terms

re-representation: The cognitive ability to dynamically change how a concept is encoded (e.g., ignoring 'furry' to focus on 'four-legged') to fit a specific analogy

schema induction: The process of inferring the abstract rule or relational structure that governs a set of examples

RHS: Right-Hand Side—the target domain in the analogy (symbol sequences like '***')

LHS: Left-Hand Side—the source domain in the analogy (semantic words like 'dog')

SMT: Structure Mapping Theory—a cognitive theory of analogy emphasizing the mapping of relations over attributes

MMLU: Massive Multitask Language Understanding—a general benchmark for LLM capabilities used here to correlate model scale with analogy performance