← Back to Paper List

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang
Microsoft Research
arXiv
Reasoning Agent MM Benchmark Factuality Pretraining

📝 Paper Summary

Large Language Model capabilities Artificial General Intelligence (AGI) Model evaluation
An early non-multimodal version of GPT-4 exhibits broad, human-level capabilities across mathematics, coding, and reasoning, suggesting it is a significant step toward artificial general intelligence.
Core Problem
Traditional benchmarks are insufficient for measuring the intelligence of LLMs trained on vast, unknown data because they cannot distinguish memorization from true generalizable reasoning.
Why it matters:
  • Standard metrics fail to capture the breadth of capabilities in models like GPT-4, which span disciplines from law to art
  • Understanding the emergent behaviors and limitations of these models is crucial for anticipating societal impacts and directing future AI research
  • Existing narrow AI systems cannot integrate skills across domains (e.g., combining poetry with mathematical proofs)
Concrete Example: When asked to write a proof of the infinitude of primes in the form of a poem, ChatGPT produces a rudimentary attempt, whereas GPT-4 generates a structurally sound and stylistically consistent poem that accurately incorporates the mathematical logic.
Key Novelty
Psychology-inspired qualitative testing
  • Evaluates the model using novel, complex tasks generated by human curiosity rather than static datasets, preventing simple memorization
  • Probes the model's understanding by varying constraints and asking it to combine disparate concepts (e.g., drawing a unicorn using TiKZ code)
Evaluation Highlights
  • Passes mock LeetCode engineering interviews, solving all questions in 10 minutes (beating >93% of users)
  • Achieves ~80% accuracy on US Medical Licensing Exam (Steps 1, 2, and 3) preliminary tests
  • Scores above 70% on a preliminary test of the Multistate Bar Exam
Breakthrough Assessment
9/10
While qualitative, the paper documents a massive leap in capability across diverse domains (coding, math, law) compared to previous models, effectively redefining the baseline for general intelligence.
×