Evaluation Setup
Zero-shot completion of 4-question quizzes. Human subjects compared against LLMs.
Benchmarks:
- Semantic Structure Task (Mapping semantic relations to symbol patterns) [New]
- Semantic Content Task (Mapping semantic attributes (categorial/numeric) to symbols) [New]
Metrics:
- Match to reference (Proportion of answers matching the intended rule-based solution)
- Statistical methodology: Logistic regression with interaction terms (Subject Type × Condition). Significance testing using likelihood ratio tests.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on the 'Defaults' condition shows advanced LLMs match human abilities, but 'Permuted Pairs' reveals model fragility. |
| Semantic Structure (Defaults) |
Match to reference |
0.85 |
0.85 |
0.00
|
| Semantic Structure (Permuted Pairs) |
Match to reference |
0.85 |
0.55 |
-0.30
|
| Semantic Structure (Permuted Pairs) |
Match to reference |
0.85 |
0.40 |
-0.45
|
| The 'Randoms' condition tests if subjects can ignore misleading semantic text and solve using only symbols (RHS). Humans adapt; models fail. |
| Semantic Structure (Randoms) |
Match to reference |
0.80 |
0.30 |
-0.50
|
| Semantic Structure (2xN) |
Match to reference |
0.70 |
0.35 |
-0.35
|
Main Takeaways
- Advanced LLMs (GPT-4, Claude 3) can match human performance on complex analogy tasks involving semantic re-representation in standard conditions.
- LLMs are highly sensitive to presentation order (Permuted Pairs) and irrelevant semantic information (Randoms), whereas humans are robust to these factors.
- In conditions where semantic structure is misleading (Randoms), humans successfully switch to a symbol-only strategy, while LLMs fail to disengage from the semantic content, indicating a lack of strategic flexibility.
- While LLMs provide a 'how-possibly' explanation for analogical behavior (emerging from statistical learning), the mechanistic divergence in stress tests suggests they do not provide a 'how-actually' explanation of human cognition.