← Back to Paper List

Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models

Lachlan McGinness, Peter Baumgartner
School of Computer Science, Australian National University, Data, CSIRO
arXiv (2025)
Reasoning Benchmark Factuality

📝 Paper Summary

LLM Logical Reasoning Automated Theorem Proving (ATP) strategies Longitudinal Evaluation
A longitudinal study reveals that frontier LLM reasoning capabilities have stalled between late 2023 and mid-2024, with apparent improvements driven by system prompts and formatting rather than genuine deductive logic gains.
Core Problem
Benchmarks for LLM reasoning are often contaminated or focus solely on answer accuracy, failing to distinguish between genuine logical deduction and rote memorization or pattern matching.
Why it matters:
  • Current leaderboards incentivize 'bolded columns' (narrow SOTA wins) without reporting uncertainty, creating a false narrative of rapid reasoning progress
  • Accurate answers do not guarantee sound reasoning; models may guess correctly or rely on training data recall rather than logic
  • Understanding whether LLMs can faithfully execute Automated Theorem Proving (ATP) strategies is crucial for deploying them in high-reliability domains like law or healthcare
Concrete Example: In a 'False Ontology' steamroller problem (e.g., where 'cats are reptiles'), an LLM might answer the final query correctly ('True') based on hidden prompts or heuristics, while failing to generate the valid intermediate derivation steps required to prove it, or by skipping steps entirely.
Key Novelty
Longitudinal ATP-Strategy Evaluation on PRONTOQA
  • Evaluates models not just on accuracy, but on their ability to adopt specific Automated Theorem Proving (ATP) strategies: Bottom Up, Top Down, and Magic Set Transformation
  • Uses a longitudinal approach (comparing Dec 2023 vs Aug 2024 models) on a dynamic benchmark (PRONTOQA) to avoid dataset contamination
  • Verifies reasoning faithfulness using a semantic triple parser (SpaCy) to ensure the generated proof steps strictly follow the requested logical path
Evaluation Highlights
  • Progress in reasoning ability stalled over the nine-month period (Dec 2023–Aug 2024), with gains largely attributed to hidden system prompts
  • Frontier models perform best using Bottom Up (forward-chaining) strategies compared to Top Down or Magic Set approaches
  • A positive correlation exists between an LLM's ability to generate correct reasoning steps and arriving at the correct final conclusion
Breakthrough Assessment
4/10
The paper is a critical evaluation/negative result paper rather than a method proposal. It provides valuable insight into the stagnation of inherent reasoning capabilities, challenging the hype of constant progress.
×