MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

📝 Paper Summary

Reasoning Benchmarks Neurosymbolic AI

MuSR is a challenging dataset of natural language narratives generated via a neurosymbolic process to evaluate large language models on multistep reasoning that integrates commonsense.

Core Problem

Current reasoning benchmarks are either too simple, lack natural text narratives, do not integrate commonsense with multistep logic, or are easily solvable by rule-based systems without true understanding.

Why it matters:

Evaluating LLM reasoning is difficult because model capabilities outpace static benchmarks
Existing datasets like CLUTRR or RuleTakers are solvable by rule-based systems and lack natural language nuance
Benchmarks involving commonsense (e.g., SocialIQA) often lack multistep reasoning complexity

Concrete Example: In a generated murder mystery, an LLM might successfully identify a suspect based on explicit text but fail to deduce 'motive' when it requires combining a social norm (e.g., getting fired causes anger) with a narrative fact (e.g., the suspect was fired).

Key Novelty

Neurosymbolic Synthetic-to-Natural Generation

Constructs reasoning instances by first generating a logical 'reasoning tree' of gold facts and commonsense inferences, then using an LLM to generate a natural narrative based on that structure
Ensures datasets have complex ground-truth intermediate structures (unlike pure synthetic text) but remain grounded in naturalistic narrative (unlike pure logic puzzles)

Architecture

The Dataset Construction Process for MuSR.

Evaluation Highlights

GPT-4 with CoT+ achieves 80.4% accuracy on Murder Mysteries, significantly lagging behind the Human Majority baseline of 94.1%
Llama 2 70B Chat performs near random chance across all domains (e.g., 42.2% on Object Placements vs 24.6% random baseline)
Program-Aided Language Models (PAL) outperform end-to-end GPT-4 on Team Allocation (87.2% vs 68.4%) by offloading calculation to Python

Breakthrough Assessment

8/10

A significant contribution to reasoning benchmarks. The generation methodology effectively bridges the gap between rigid logic puzzles and messy natural language, exposing clear limitations in current SOTA models.

⚙️ Technical Details

Problem Definition

Setting: Question Answering over long natural language narratives (approx. 1000 words)

Inputs: Natural language narrative x and a question q

Outputs: Answer a selected from a set of candidates

Pipeline Flow

Prompt Construction (Regular, CoT, or CoT+)
LLM Inference (Generation of reasoning/answer)
Neurosymbolic Execution (Optional: Python execution for PAL or Graph parsing for SymbolicTOM)

System Modules

Prompting Interface

Wraps the narrative and question with instructions (e.g., 'Think step-by-step')

Model or implementation: GPT-4 / GPT-3.5 / Llama 2

PAL Solver (Team Allocation only) (Neurosymbolic Inference)

Extracts variables and constraints from text into Python code to solve optimization problems

Model or implementation: GPT-4 (code generation) + Python Runtime

Decomposed Prompting (Murder Mystery only) (Neurosymbolic Inference)

Breaks the mystery into sub-questions (Means, Motive, Opportunity) for each suspect

Model or implementation: GPT-4

Novel Architectural Elements

Data Generation Architecture (Novelty): The paper's primary architectural contribution is the 'Tree-to-Text' generation pipeline (Tree Template -> Reasoning Tree Completion -> Story Generation), rather than a novel inference model.

Modeling

Base Model: GPT-4 (used for dataset generation and as the primary strong baseline)

Training Method: Not applicable — Inference-only evaluation on synthetic dataset

Training Data:

756 total examples across 3 domains (Murder Mystery, Object Placements, Team Allocation)
Dataset generated using GPT-4 with a neurosymbolic pipeline
Validated by human annotators (Triply-annotated subset)

Compute: Not reported in the paper

Comparison to Prior Work

vs. True Detective: MuSR is synthetic (scalable) and explicitly structured with ground truth intermediate reasoning trees
vs. CLUTRR: MuSR involves natural language narratives with commonsense ('soft') reasoning, not just strict logical rules
vs. SocialIQA: MuSR requires multistep reasoning across a long narrative, whereas SocialIQA is typically single-step

Limitations

The dataset is generated by GPT-4, potentially introducing biases or artifacts that GPT-4 might find easier to solve (though results show it still struggles)
Evaluation is limited to accuracy; intrinsic evaluation of the generated intermediate reasoning traces is not deeply explored
Rule-based baselines are relatively simple (length-based heuristics)
Neurosymbolic baselines require domain-specific engineering and are not general-purpose

Reproducibility

Code: https://github.com/Zayne-Sprague/MuSR

publicly available (https://github.com/Zayne-Sprague/MuSR). The repository contains the dataset and the generation code. Prompts for CoT+ and neurosymbolic baselines are provided in Appendix I.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot Question Answering on generated narratives

Benchmarks:

MuSR - Murder Mystery (Deductive Reasoning (Motive, Means, Opportunity)) [New]
MuSR - Object Placements (Theory of Mind / Spatial Reasoning) [New]
MuSR - Team Allocation (Constraint Satisfaction / Optimization) [New]

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GPT-4 outperforms other LLMs but falls short of human performance across all domains using CoT+ prompting.
MuSR - Murder Mystery	Accuracy	94.1	80.4	-13.7
MuSR - Object Placements	Accuracy	95.0	60.9	-34.1
MuSR - Team Allocation	Accuracy	100.0	68.4	-31.6
Neurosymbolic approaches (PAL) can outperform standard prompting in logic-heavy domains.
MuSR - Team Allocation	Accuracy	68.4	87.2	+18.8
Open source models struggle significantly on the benchmark.
MuSR - Murder Mystery	Accuracy	50.0	48.8	-1.2

Experiment Figures

Partial reasoning trees for the three domains: Murder Mystery, Object Placements, and Team Allocation.

Main Takeaways

LLMs, including GPT-4, struggle with reasoning tasks that require combining multistep deduction with commonsense interpretation of natural text.
The 'CoT+' prompting strategy (adding domain reasoning rules) improves performance over standard CoT, but a large gap to human performance remains.
Neurosymbolic methods (like PAL) are highly effective for domains that map cleanly to code (Team Allocation) but struggle in domains requiring nuanced text interpretation (Object Placements/Theory of Mind).
Current open-source models (Llama 2, Vicuna) lack the reasoning depth to perform better than random chance on these complex narratives.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Chain-of-Thought (CoT) prompting
Understanding of Neurosymbolic AI concepts
Basic knowledge of Logic Programming / Reasoning Trees

Key Terms

Neurosymbolic: Systems combining neural networks (like LLMs) with symbolic logic or structured reasoning methods

Chain-of-Thought: A prompting strategy where the model produces intermediate reasoning steps before the final answer

CoT+: A variant of Chain-of-Thought prompting introduced in this paper that includes a textual description of the specific domain's reasoning strategy

Soft Reasoning: Reasoning that combines strict logical deduction with imprecise commonsense knowledge (e.g., social norms)

PAL: Program-Aided Language Models—a technique where the LLM generates code (e.g., Python) to solve reasoning problems instead of predicting the answer directly

Theory of Mind: The ability to attribute mental states (beliefs, intents, knowledge) to oneself and others; used in the Object Placements domain

Decomposed Prompting: A neurosymbolic approach where a complex task is broken down into sub-tasks handled by separate prompts

MAX-SAT: Maximum Satisfiability Problem—finding an assignment that satisfies the maximum number of constraints; the logical basis for the Team Allocation domain