Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models

📝 Paper Summary

LLM Logical Reasoning Automated Theorem Proving (ATP) strategies Longitudinal Evaluation

A longitudinal study reveals that frontier LLM reasoning capabilities have stalled between late 2023 and mid-2024, with apparent improvements driven by system prompts and formatting rather than genuine deductive logic gains.

Core Problem

Benchmarks for LLM reasoning are often contaminated or focus solely on answer accuracy, failing to distinguish between genuine logical deduction and rote memorization or pattern matching.

Why it matters:

Current leaderboards incentivize 'bolded columns' (narrow SOTA wins) without reporting uncertainty, creating a false narrative of rapid reasoning progress
Accurate answers do not guarantee sound reasoning; models may guess correctly or rely on training data recall rather than logic
Understanding whether LLMs can faithfully execute Automated Theorem Proving (ATP) strategies is crucial for deploying them in high-reliability domains like law or healthcare

Concrete Example: In a 'False Ontology' steamroller problem (e.g., where 'cats are reptiles'), an LLM might answer the final query correctly ('True') based on hidden prompts or heuristics, while failing to generate the valid intermediate derivation steps required to prove it, or by skipping steps entirely.

Key Novelty

Longitudinal ATP-Strategy Evaluation on PRONTOQA

Evaluates models not just on accuracy, but on their ability to adopt specific Automated Theorem Proving (ATP) strategies: Bottom Up, Top Down, and Magic Set Transformation
Uses a longitudinal approach (comparing Dec 2023 vs Aug 2024 models) on a dynamic benchmark (PRONTOQA) to avoid dataset contamination
Verifies reasoning faithfulness using a semantic triple parser (SpaCy) to ensure the generated proof steps strictly follow the requested logical path

Evaluation Highlights

Progress in reasoning ability stalled over the nine-month period (Dec 2023–Aug 2024), with gains largely attributed to hidden system prompts
Frontier models perform best using Bottom Up (forward-chaining) strategies compared to Top Down or Magic Set approaches
A positive correlation exists between an LLM's ability to generate correct reasoning steps and arriving at the correct final conclusion

Breakthrough Assessment

4/10

The paper is a critical evaluation/negative result paper rather than a method proposal. It provides valuable insight into the stagnation of inherent reasoning capabilities, challenging the hype of constant progress.

⚙️ Technical Details

Problem Definition

Setting: Multi-hop deductive reasoning on generated first-order logic problems (PRONTOQA)

Inputs: Natural language logic puzzle P consisting of facts, rules, and a query Q

Outputs: Binary answer (True/False) and a step-by-step reasoning trace confirming the derivation

Pipeline Flow

Problem Generation (PRONTOQA) → Prompting (Strategy Injection) → LLM Inference → Response Parsing (SpaCy) → Verification (Logic Check)

System Modules

Question Generator

Generate random logic problems with controlled hops (1-3) and distractors using PRONTOQA

Model or implementation: PRONTOQA Codebase

LLM Inference

Solve the logic problem using the requested reasoning strategy

Model or implementation: Target LLM (e.g., GPT-4, Llama3.1 405B)

Parser

Convert natural language response into structured semantic triples for verification

Model or implementation: SpaCy (NLP library)

Verifier

Check if extracted triples match ground truth steps and adhere to the requested strategy order

Model or implementation: Deterministic Algorithm

Modeling

Base Model: Evaluated: GPT-3.5 Turbo, GPT-4, GPT-4o, Gemini-Pro, Claude 3 Opus, Llama3.1 405B

Training Method: In-context learning (Prompting) only. No fine-tuning performed in this study.

Compute: Not reported in the paper

Comparison to Prior Work

vs. LAMBADA: Explicitly teaches the LLM to perform Top Down reasoning in-context rather than using an external control algorithm to manage the search
vs. CoT: Evaluates strict ATP strategies (including Magic Set and Top Down) and verifies the logical soundness of the steps, not just the final answer

Limitations

Evaluation limited to specific 'steamroller' logic puzzles (PRONTOQA), which may not represent all reasoning types
Reliance on proprietary models (GPT-4, Gemini) means exact model architecture and training data are unknown (black box)
Latency and cost were not measured due to disparate hosting platforms
Parsing natural language proofs into logic triples via SpaCy may introduce errors if the model uses unexpected phrasing

Reproducibility

PRONTOQA generator is public (Saparov and He). The specific prompts for strategies (Bottom Up, Top Down, Magic Set) are in Appendix C of the paper. Analysis code is not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Logical deduction on PRONTOQA 'False Ontology' dataset with distractors

Benchmarks:

PRONTOQA (First-order logic deduction (True/False))

Metrics:

Accuracy (Fraction of correct T/F answers)
Reasoning Correctness (Fraction of responses containing valid ground truth proof steps)
Strategy Faithfulness (Fraction of responses following the prompted reasoning order)
Statistical methodology: Means and variations calculated over 3 trials of 100 calls each (300 calls total per condition). Error bars denote min/max values.

Main Takeaways

LLM reasoning capabilities showed no significant improvement between Dec 2023 and Aug 2024 models when stripped of system prompt aids
Models are most effective at Bottom Up (forward chaining) reasoning, likely because it aligns with the left-to-right generation nature of autoregressive models and CoT training
There is a confirmed positive correlation between generating a sound proof and answering the query correctly, suggesting reasoning is not entirely decoupled from the answer
Random guessing baseline for this binary task is theoretically 0.50 (SD=0.042), which serves as the lower bound for valid reasoning evaluation

📚 Prerequisite Knowledge

Prerequisites

Basic logic (modus ponens, deductive reasoning)
Automated Theorem Proving (ATP) search strategies
In-context learning / Prompt engineering

Key Terms

PRONTOQA: A benchmark generator for first-order logic problems that allows controlling ontology (True/False) and graph structure to test reasoning

Steamroller problems: Logic puzzles involving multiple steps of deduction (hops) designed to challenge reasoning systems; often used to test ATPs

False Ontology: A problem setting where facts contradict real-world knowledge (e.g., 'polars bears are reptiles') to force the model to rely on logic rather than memory

ATP: Automated Theorem Prover—algorithms that use logical rules to search for proofs

Bottom Up: A reasoning strategy (Forward Chaining) starting from facts to derive new conclusions until the query is answered

Top Down: A reasoning strategy (Backward Chaining) starting from the query/goal and working backwards to find supporting facts

Magic Set Transformation: An optimization strategy combining Top Down filtering (to find relevant rules) with Bottom Up execution

Default Negation: The assumption that if a statement cannot be proven true (or the model is unsure), it is treated as false