Sparks of Artificial General Intelligence: Early experiments with GPT-4

📝 Paper Summary

Large Language Model capabilities Artificial General Intelligence (AGI) Model evaluation

An early non-multimodal version of GPT-4 exhibits broad, human-level capabilities across mathematics, coding, and reasoning, suggesting it is a significant step toward artificial general intelligence.

Core Problem

Traditional benchmarks are insufficient for measuring the intelligence of LLMs trained on vast, unknown data because they cannot distinguish memorization from true generalizable reasoning.

Why it matters:

Standard metrics fail to capture the breadth of capabilities in models like GPT-4, which span disciplines from law to art
Understanding the emergent behaviors and limitations of these models is crucial for anticipating societal impacts and directing future AI research
Existing narrow AI systems cannot integrate skills across domains (e.g., combining poetry with mathematical proofs)

Concrete Example: When asked to write a proof of the infinitude of primes in the form of a poem, ChatGPT produces a rudimentary attempt, whereas GPT-4 generates a structurally sound and stylistically consistent poem that accurately incorporates the mathematical logic.

Key Novelty

Psychology-inspired qualitative testing

Evaluates the model using novel, complex tasks generated by human curiosity rather than static datasets, preventing simple memorization
Probes the model's understanding by varying constraints and asking it to combine disparate concepts (e.g., drawing a unicorn using TiKZ code)

Evaluation Highlights

Passes mock LeetCode engineering interviews, solving all questions in 10 minutes (beating >93% of users)
Achieves ~80% accuracy on US Medical Licensing Exam (Steps 1, 2, and 3) preliminary tests
Scores above 70% on a preliminary test of the Multistate Bar Exam

Breakthrough Assessment

9/10

While qualitative, the paper documents a massive leap in capability across diverse domains (coding, math, law) compared to previous models, effectively redefining the baseline for general intelligence.

⚙️ Technical Details

Problem Definition

Setting: Natural language interaction with a large language model to assess general intelligence capabilities

Inputs: Natural language prompts (questions, instructions, code specifications)

Outputs: Textual responses, code generation, or structured text representing visual data

Pipeline Flow

User Prompt -> GPT-4 (Early Version) -> Text/Code Output

System Modules

GPT-4 (Early Version)

Generate responses to diverse prompts across domains

Model or implementation: GPT-4 (early non-multimodal version)

Modeling

Base Model: GPT-4 (early version)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ChatGPT: GPT-4 demonstrates vastly superior integrative abilities (e.g., combining math and poetry) and significantly reduced common sense errors
vs. PaLM: GPT-4 shows stronger performance in coding and reasoning tasks without needing chain-of-thought prompting for many cases [not cited in paper, implied by general comparison to prior state of the art]

Limitations

Still suffers from hallucinations (making up facts) and basic arithmetic errors
Lack of planning in text generation due to autoregressive architecture (cannot look ahead)
Performance metrics are estimates based on an early version, not the final deployed model
Evaluation is largely qualitative and subjective, lacking the rigor of standard statistical benchmarks

Reproducibility

Not provided (No replication artifacts mentioned in the paper). The paper evaluates a proprietary model (GPT-4) accessed during its development phase. The exact version and training data are not public.

📊 Experiments & Results

Evaluation Setup

Qualitative probing via novel, difficult prompts across diverse domains (Math, Coding, Vision, Psychology)

Benchmarks:

LeetCode (Coding Interview Questions)
US Medical Licensing Exam (USMLE) (Medical Knowledge)
Multistate Bar Exam (MBE) (Legal Knowledge)

Metrics:

Accuracy (for exams)
Pass rate (for coding)
Qualitative assessment of coherence and creativity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
US Medical Licensing Exam (Step 1, 2, 3)	Accuracy	Not reported in the paper	80	Not reported in the paper
Multistate Bar Exam	Accuracy	Not reported in the paper	70	Not reported in the paper
LeetCode Interview Assessment	Percentile	50	93	+43
LeetCode Interview Assessment	Score	10	8.96	-1.04

Main Takeaways

GPT-4 demonstrates 'sparks' of AGI by solving novel tasks spanning disjoint domains (e.g., TiKZ drawing via code) without special prompting
Exhibits Theory of Mind capabilities, able to reason about the beliefs and mental states of others in realistic scenarios
Despite high performance, the model retains specific 'non-human' failure modes, such as a lack of long-term planning and discrete arithmetic errors

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and the Transformer architecture
Familiarity with the concept of Artificial General Intelligence (AGI)
Basic knowledge of standard AI benchmarks (e.g., BIG-bench, LeetCode)

Key Terms

AGI: Artificial General Intelligence—systems that demonstrate broad capabilities of intelligence, including reasoning, planning, and learning, at or above human-level

LLM: Large Language Model—neural network models trained on massive text corpora to predict the next word in a sequence

TiKZ: A language for creating graphics in LaTeX, used here to test the model's ability to generate visual content via code

LeetCode: A platform for technical interview preparation, used to evaluate coding proficiency

Hallucination: The generation of factually incorrect or nonsensical information by an LLM

Theory of Mind: The ability to attribute mental states—beliefs, intents, desires, emotions, knowledge—to oneself and others

Autoregressive: A property of models that generate output one token at a time, using previously generated tokens as context for the next

Zero-shot: The ability of a model to perform a task without having seen explicit examples of that specific task during training