Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?

📝 Paper Summary

Analogical Reasoning Neuro-Symbolic AI Large Reasoning Models (LRMs)

Large Reasoning Models like o3-mini fail significantly at analogical reasoning when visual inputs contain noise, whereas neuro-symbolic models with entropy-based regularization remain robust.

Core Problem

Current evaluations of Large Reasoning Models (LRMs) on analogical reasoning assume 'oracle perception' (perfect symbolic inputs), ignoring the noise and uncertainty inherent in real-world visual perception.

Why it matters:

LRMs are being deployed for complex reasoning, but their robustness to noisy, real-world data remains untested and potentially brittle
Assuming perfect perception bypasses the critical challenge of filtering irrelevant attributes (confounders) and handling uncertain values
Existing benchmarks like I-RAVEN are too clean, leading to inflated performance estimates that do not reflect true generalization capabilities

Concrete Example: In a Raven's Progressive Matrix puzzle, an LRM might easily solve the logic if given perfect symbols (e.g., 'Shape: Triangle'). However, if the input includes irrelevant background patterns (confounders) or the shape classifier outputs a probability distribution (e.g., 0.6 Triangle, 0.4 Square) instead of a certainty, the LRM's reasoning process collapses.

Key Novelty

Benchmarking under Simulated Perceptual Uncertainty & Entropy-Regularized Abduction

Extends the I-RAVEN-X benchmark by injecting confounding attributes (irrelevant visual noise) and smoothing attribute values into probability distributions to simulate imperfect perception
Introduces an entropy-based regularizer for neuro-symbolic models that weights attribute contributions by their confidence, allowing the model to ignore uncertain or noisy features during reasoning

Architecture

Comparison of the standard I-RAVEN task vs. the proposed I-RAVEN-X with perceptual uncertainty

Evaluation Highlights

OpenAI o3-mini accuracy plummets from 86.6% on standard I-RAVEN to 17.0% on the noisy I-RAVEN-X extension, approaching random chance
DeepSeek R1 shows a similar decline, dropping from 80.6% to 23.2% accuracy under perceptual uncertainty
The proposed neuro-symbolic model (ARLC) maintains strong robustness, dropping only from 98.6% to 88.0% accuracy on the same difficult benchmark

Breakthrough Assessment

8/10

Reveals a critical brittleness in state-of-the-art LRMs (o3-mini, R1) regarding noisy inputs, countering the hype of their reasoning dominance, while demonstrating a viable neuro-symbolic solution.

⚙️ Technical Details

Problem Definition

Setting: Abstract visual reasoning on Raven's Progressive Matrices (RPM) under perceptual uncertainty

Inputs: A 3x3 matrix of panels with one missing panel, where panel attributes are noisy/probabilistic rather than deterministic

Outputs: Selection of the correct missing panel from a candidate set (discriminative setting)

Pipeline Flow

Perception (Front-end)
Reasoning (Abduction / LRM CoT)
Selection (Discriminator)

System Modules

Perception Simulator

Generates symbolic attributes with injected noise

Model or implementation: Procedural generation

Reasoning Core (NeSy variant) (Reasoning)

Infers the underlying rule governing the matrix

Model or implementation: ARLC (Abductive Reasoning with Learned Combinations)

Reasoning Core (LRM variant) (Reasoning)

Infers the solution via Chain-of-Thought

Model or implementation: o3-mini or DeepSeek R1

Novel Architectural Elements

Entropy-based confidence regularization for the abductive reasoning loss function, dynamically re-weighting attribute contributions based on rule certainty

Modeling

Base Model: OpenAI o3-mini and DeepSeek R1 (LRMs); ARLC (Neuro-Symbolic model)

Training Method: Supervised learning with entropy regularization (for ARLC)

Objective Functions:

Purpose: Regularize attribute contribution based on rule confidence entropy.

Formally: Weight w = 1 - H(s), where H(s) is the entropy of the rule confidence vector s.

Trainable Parameters: Rule representations in ARLC

Training Data:

I-RAVEN-X dataset augmented with confounding attributes and smoothened values

Key Hyperparameters:

p_L: Lower bound for probability of true value in 3-bins smoothing (0.5)

Compute: o3-mini used approx 3.4x more reasoning tokens on difficult tasks (evaluation only). Training compute for ARLC not reported.

Comparison to Prior Work

vs. GPT-4/Llama-3: Evaluates new 'Large Reasoning Models' (o3-mini, R1) specifically designed for reasoning
vs. NVSA/PrAE: Introduces entropy-based regularization to explicitly handle high-uncertainty confounders in the abductive process

Limitations

Evaluation of LRMs uses text-based prompts (symbolic input), not raw pixels, so it simulates visual uncertainty rather than testing actual vision-language integration
Only evaluates one specific type of reasoning puzzle (Raven's Progressive Matrices)
Entangled prompts were used for LRMs which perform worse than disentangled ones, though this was necessary to make confounders non-trivial

Reproducibility

Code: https://github.com/IBM/raven-large-language-models

Code is publicly available at https://github.com/IBM/raven-large-language-models. The paper uses closed-source models (o3-mini) and open-weights models (DeepSeek R1). Dataset generation code for I-RAVEN-X with confounders is included.

📊 Experiments & Results

Evaluation Setup

Discriminative analogical reasoning on abstract visual matrices

Benchmarks:

I-RAVEN (Abstract visual reasoning (standard))
I-RAVEN-X (Abstract visual reasoning (OOD generalization + uncertainty)) [New]

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance drops dramatically for LRMs when moving from the standard I-RAVEN benchmark to the noisy, out-of-distribution I-RAVEN-X.
I-RAVEN (Standard)	Accuracy	80.6	86.6	+6.0
I-RAVEN-X (Noisy OOD)	Accuracy	86.6	17.0	-69.6
I-RAVEN-X (Noisy OOD)	Accuracy	80.6	23.2	-57.4
I-RAVEN-X (Noisy OOD)	Accuracy	98.6	88.0	-10.6

Main Takeaways

Large Reasoning Models (LRMs) like o3-mini and DeepSeek R1 are highly brittle to perceptual uncertainty; their performance collapses to near random chance when inputs are noisy.
The massive drop in LRM accuracy occurs despite o3-mini spending ~3.4x more reasoning tokens, indicating that more test-time compute does not automatically solve perceptual ambiguity.
Neuro-symbolic probabilistic models (ARLC) equipped with entropy-based regularization can robustly filter out confounding attributes, maintaining high accuracy even in low Signal-to-Noise Ratio (SNR) conditions.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Raven's Progressive Matrices (RPM)
Familiarity with Chain-of-Thought (CoT) reasoning in LLMs
Basic concepts of probabilistic abductive reasoning

Key Terms

LRM: Large Reasoning Model—a class of LLMs optimized for reasoning via test-time compute scaling (e.g., o1, o3-mini, R1)

RPM: Raven's Progressive Matrices—a nonverbal IQ test involving completing a pattern in a 3x3 grid of images

I-RAVEN-X: An extension of the I-RAVEN dataset that tests generalization to longer rules and larger attribute ranges

Confounding attributes: Randomly sampled visual properties (e.g., background textures) included in the input that are irrelevant to the underlying logic rule

Oracle perception: The unrealistic assumption that a reasoning model has access to perfect, noise-free symbolic descriptions of visual inputs

Abductive reasoning: A logical inference method that seeks the simplest explanation (rule) for a set of observations

NeSy: Neuro-Symbolic—AI systems combining neural networks (for perception/learning) with symbolic logic (for reasoning)

PMF: Probability Mass Function—a distribution representing the probability of a discrete random variable taking specific values

SNR: Signal-to-Noise Ratio—ratio of useful information (reasoning attributes) to irrelevant data (confounders)