Simulated Annealing Enhances Theory-of-Mind Reasoning in Autoregressive Language Models

📝 Paper Summary

Theory of Mind (ToM) Test-time compute / inference-time optimization Sampling methods for LLMs

Applying simulated annealing to the sequence-level distribution of small language models recovers strong Theory of Mind capabilities, particularly for false belief tasks, without any parameter updates.

Core Problem

Autoregressive models optimize local plausibility (next-token prediction) rather than global coherence, causing them to fail at Theory of Mind tasks that require maintaining consistent latent belief states.

Why it matters:

Models often contradict earlier commitments or implied states, failing to maintain a consistent 'world model'
False Belief tasks are a critical test of reasoning, and failure implies a lack of robust social intelligence or planning capability
Retraining or scaling up models is expensive; recovering capabilities from existing small models is more efficient

Concrete Example: In a False Belief task where a character (Carlos) doesn't see a valve open, a standard model might correctly state the physical event but incorrectly infer Carlos knows it, or hallucinate that the valve was already open to justify his action. The proposed method correctly infers his ignorance and subsequent actions.

Key Novelty

Test-Time Simulated Annealing for Sequence Optimization

Treats text generation as a global optimization problem over the sequence-level distribution rather than greedy next-token prediction
Uses Markov Chain Monte Carlo (MCMC) with a cooling temperature schedule (annealing) to explore and then converge on globally coherent sequences
Demonstrates that 'distorting' the probability landscape (sharpening) reveals latent reasoning abilities hidden by standard sampling

Architecture

Comparison of Standard MCMC Power Sampling vs. Simulated Annealing MCMC

Evaluation Highlights

Simulated annealing outperforms Chain-of-Thought and standard sampling on the BigToM benchmark, particularly on hard False Belief tasks
Qualitatively recovers 'inverse planning' reasoning: models explicitly test hypotheses (e.g., 'If he believed X, he wouldn't do Y...')
Small models (1.7B-3.8B parameters) achieve performance previously thought to require frontier-scale models

Breakthrough Assessment

7/10

Strong demonstration that 'reasoning failures' may be decoding failures. Unlocks significant capability in small models without training, though computational cost of MCMC is high.

⚙️ Technical Details

Problem Definition

Setting: Generating a text sequence x that maximizes global coherence/plausibility under a distorted sequence-level distribution

Inputs: A narrative context (prefix) x_<t involving agents and belief states

Outputs: A completion x_>t (typically an answer to a belief query) that is globally consistent with the latent belief graph

Pipeline Flow

Input Prefix (Story Context)
Initialization (Generate initial sequence)
MCMC Sampling Loop (Propose changes → Accept/Reject based on Temperature)
Temperature Cooling (Annealing Schedule)
Final Sequence Output

System Modules

MCMC Sampler

Explore the space of possible text completions

Model or implementation: Small Autoregressive LM (Phi-3.5, LLaMA-3.2, or Qwen3)

Annealing Scheduler

Control exploration vs. exploitation

Novel Architectural Elements

Integration of simulated annealing temperature schedule directly into the MCMC sampling loop of an autoregressive model for test-time optimization

Modeling

Base Model: Phi-3.5-Mini-Instruct (3.8B), LLaMA-3.2-3B-Instruct, Qwen3-1.7B

Comparison to Prior Work

vs. CoT: Optimizes the sequence probability globally rather than relying on greedy next-token coherence; CoT often degrades False Belief performance in small models
vs. Power Sampling [14]: Uses a dynamic temperature schedule (annealing) to actively search for modes (optimization) rather than just sampling from a static sharpened distribution
vs. Tree-of-Thought [not cited in paper]: Annealing is a continuous stochastic search, whereas ToT is a discrete tree search

Limitations

Computationally expensive at inference time due to iterative MCMC resampling (significantly slower than single-pass decoding)
Failures persist in very complex cases (deeply nested beliefs or complex causal structures)
Scalability to longer sequences or larger models remains a challenge
Unclear if benefits generalize beyond Theory of Mind to domains like planning or math

Reproducibility

Method is described in detail (temperature schedule, MCMC steps, block size). Base models are open weights. Code URL is not provided in the text.

📊 Experiments & Results

Evaluation Setup

Text-based Theory of Mind evaluation using synthetic narratives

Benchmarks:

BigToM (Theory of Mind / Social Reasoning)

Metrics:

Accuracy (Binary choice on belief queries)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BigToM (False Belief)	Accuracy	46.5	59.0	+12.5
BigToM (False Belief)	Accuracy	40.0	53.0	+13.0
BigToM (False Belief)	Accuracy	42.0	53.0	+11.0

Experiment Figures

Bar charts comparing accuracy across 3 models (Phi-3.5, LLaMA-3.2, Qwen3) and 5 decoding methods for True Belief vs. False Belief tasks.

Main Takeaways

Simulated annealing consistently improves performance on the harder False Belief (FB) tasks compared to direct decoding and CoT
Baselines like CoT and fixed-temperature sampling often improve True Belief (TB) accuracy at the expense of FB accuracy (trade-off), whereas annealing improves both or maintains high TB while boosting FB
Qualitative analysis shows annealing produces 'inverse planning' reasoning chains (checking counterfactuals) that are absent in greedy decoding

📚 Prerequisite Knowledge

Prerequisites

Autoregressive language modeling (next-token prediction)
Markov Chain Monte Carlo (MCMC) methods, specifically Metropolis-Hastings
Simulated Annealing (optimization via cooling schedules)
Theory of Mind (True Belief vs. False Belief tasks)

Key Terms

ToM: Theory of Mind—the ability to impute mental states (beliefs, intents) to oneself and others

False Belief (FB): A scenario where an agent's belief differs from reality (e.g., they didn't see an object move)

True Belief (TB): A scenario where an agent's belief matches reality

MCMC: Markov Chain Monte Carlo—algorithms for sampling from a probability distribution by constructing a Markov chain that has the desired distribution as its equilibrium distribution

Simulated Annealing: An optimization technique that explores a search space at high 'temperature' (randomness) and gradually cools down to settle into an optimal solution

sequence-level distribution: The joint probability of an entire sequence of tokens, as opposed to the conditional probability of just the next token

power sampling: Sampling from a distribution raised to a power α > 1, which sharpens the peaks (makes likely sequences relatively more likely)

CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before the final answer