ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

📝 Paper Summary

Process-based Evaluation Small Language Models (SLMs) Commonsense Reasoning

ReTraceQA introduces a manually annotated benchmark revealing that Small Language Models frequently produce correct final answers via flawed reasoning paths, a failure mode current answer-only metrics overlook.

Core Problem

Current evaluation practices for Small Language Models (SLMs) rely almost exclusively on final answer accuracy, neglecting whether the underlying reasoning process is valid.

Why it matters:

Models can arrive at correct answers through invalid reasoning (false positives), artificially inflating performance metrics
Existing process-based benchmarks focus on math/science, leaving commonsense reasoning largely underexplored despite requiring different capabilities
Answer-only metrics mask critical hallucinations and logical errors, making models unreliable for real-world deployment where reasoning transparency matters

Concrete Example: A model might correctly answer 'A' to a question about arctic animals but justify it by claiming 'wolves are not found in arctic regions' (a hallucination). The final answer is correct, but the reasoning is factually wrong.

Key Novelty

ReTraceQA: First Gold-Standard Commonsense Reasoning Trace Benchmark

Provides 2,421 expert-annotated reasoning traces from SLMs on commonsense tasks, labeling the exact step where errors occur
Categorizes errors into Hallucination, Reasoning, and Misinterpretation, distinguishing between factual failures and logical inconsistencies
Demonstrates that Strong LLMs (like GPT-4o) struggle to localize specific errors in commonsense traces even if they can detect the trace is generally wrong

Architecture

The ReTraceQA annotation pipeline. It illustrates the process from Question -> SLM Trace Generation -> Human Annotation -> Final Benchmark.

Evaluation Highlights

14-24% of SLM instances across datasets yield the correct final answer despite having flawed reasoning processes
SLM performance scores drop by up to 25% when evaluated on reasoning correctness rather than just final answer accuracy
Hallucination errors constitute the majority of failures (41.9%–62.5%), exceeding logical reasoning errors

Breakthrough Assessment

8/10

Significantly exposes the 'false positive' problem in SLM evaluation for commonsense tasks. The manual annotation of traces provides a high-quality ground truth that is currently rare outside of math domains.

⚙️ Technical Details

Problem Definition

Setting: Binary classification or Step-level error localization in reasoning traces

Inputs: Question q, optional choices C, and a step-by-step reasoning trace S = [s0, s1, ..., sn]

Outputs: Index i representing the first error step (or -1 if correct)

Pipeline Flow

Data Collection: Sample questions from CSQA, OBQA, QASC, StrategyQA
Trace Generation: Generate CoT traces using 7 SLMs (Qwen, Llama, Phi families)
Filtering: Sample to balance correct/incorrect answers and model representation
Annotation: Human experts label first error step and error type
Evaluation: Test LLMs and PRMs on their ability to match human labels

System Modules

SLM Generators

Generate reasoning traces for commonsense questions

Model or implementation: Various (Qwen2.5, Llama-3.2, Phi-4-mini)

Human Annotator

Identify first error step and categorize error type

Model or implementation: Human Experts (PhD level)

Judge/PRM

Predict correctness of trace or location of error

Model or implementation: Various LLMs (GPT-4o, DeepSeek-R1, etc.) or specialized PRMs

Novel Architectural Elements

ReTraceQA Benchmark: A curated dataset of 2,421 reasoning traces with manual step-level error annotations specifically for commonsense domains

Modeling

Base Model: Evaluated multiple judge models: GPT-4o, DeepSeek-R1, Qwen2.5-72B, Llama-3.3-70B

Comparison to Prior Work

vs. ProcessBench: Focuses on commonsense reasoning rather than math; finds step segmentation is naturally cleaner in commonsense traces
vs. Math-Shepherd: Evaluates on commonsense tasks where math-specific training may not transfer
vs. MR-Ben: Specifically targets the discrepancy between final answer accuracy and process validity in SLMs

Limitations

Dataset generation limited to 7 specific SLMs; might not cover all reasoning styles
Human annotation is resource-intensive, limiting dataset size to ~2.4k examples
Focus is strictly on finding the *first* error; subsequent errors are ignored

Reproducibility

The paper does not explicitly provide a link to the dataset or code in the abstract or introduction, marking code availability as 'not provided'. The benchmark methodology and prompts are described in appendices.

📊 Experiments & Results

Evaluation Setup

Judges predict trace correctness (binary) and first error step index. Comparisons against human gold labels.

Benchmarks:

ReTraceQA (Custom) (Reasoning Trace Verification) [New]

Metrics:

F1 score (harmonic mean of correct trace detection and error localization accuracy)
Accuracy (binary correctness of trace)
Error Recall (identifying flawed traces)
Statistical methodology: Inter-annotator agreement measured via Fleiss's kappa (84.4%)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of the annotated ReTraceQA dataset shows a significant portion of 'correct' answers come from flawed reasoning.
ReTraceQA (Average across subsets)	Process Error Rate (Correct Final Answer)	0	17.9	+17.9
Impact of using reasoning-aware evaluation (Judge) vs standard answer-only evaluation on SLM performance.
Qwen2.5-7B-Instruct on CSQA	Performance Score	64.3	45.7	-18.6
Llama-3.1-8B-Instruct on CSQA	Performance Score	69.1	50.5	-18.6

Main Takeaways

Standard answer-only metrics overestimate SLM capabilities by up to 25% because models frequently guess right for the wrong reasons.
Hallucinations are the dominant error mode in commonsense reasoning (41-62% of errors), suggesting SLMs struggle more with factual grounding than pure logic.
Strong LLMs (like GPT-4o) act as decent judges for binary correctness but struggle with precise error localization in commonsense traces compared to math domains.
Math-trained PRMs transfer poorly to commonsense tasks, highlighting the need for domain-specific process supervision.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Process Reward Models (PRMs) vs. Outcome Reward Models (ORMs)
LLM-as-a-judge evaluation paradigms

Key Terms

SLMs: Small Language Models—models with roughly 10 billion parameters or fewer

Reasoning Trace: The step-by-step text explanation generated by a model before its final answer

PRM: Process Reward Model—a model trained to score the correctness of individual reasoning steps rather than just the final outcome

CoT: Chain-of-Thought—a prompting strategy encouraging models to generate intermediate reasoning steps

Best-of-N: A decoding strategy where multiple candidate responses are generated, and a reward model selects the best one

LLM-as-a-judge: Using a strong language model (like GPT-4) to evaluate the quality or correctness of another model's output

Commonsense Reasoning: Tasks requiring general world knowledge and intuitive physics/social understanding, rather than specialized math or coding skills

Hallucination: Generated content that is factually incorrect, unverifiable, or not grounded in the provided context