Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

📝 Paper Summary

Mechanistic Interpretability LLM Reasoning Model Selection

A model's reasoning potential is determined by its intrinsic, pre-trained ability to distinguish sound logical rules from noise, which can be quantified microscopically via the Soundness-Aware Level (SAL).

Core Problem

Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning, but its effectiveness varies drastically across different base models, and we lack a systematic way to predict which pre-trained models will become strong reasoners.

Why it matters:

Applying RLVR to the wrong base model wastes massive computational resources if the model lacks the intrinsic potential to reason
Current methods analyze macroscopic behaviors (output text) rather than the internal mechanisms driving reasoning capabilities
Understanding the microscopic determinants of reasoning helps in selecting and designing better base models for the next generation of reasoning systems

Concrete Example: When applying the exact same RLVR pipeline, Qwen-2.5-7B develops strong reasoning capabilities, while Llama-3.1-8B lags behind. Macroscopic analysis of their pre-training text doesn't explain this gap, but microscopic analysis shows Llama treats spurious correlations with the same high confidence as strict mathematical rules.

Key Novelty

Soundness-Aware Level (SAL) via Logic-SAEs

Formalizes internal model computation as 'Horn clauses' (if-then rules) between features extracted by Cross-Layer Sparse Autoencoders (SAEs)
Measures the divergence (JSD) between the model's internal confidence distributions for 'Strict' rules versus 'Noise' rules
Establishes a precise empirical law linking this internal microscopic signature directly to macroscopic post-RLVR error rates

Architecture

The workflow for calculating Soundness-Aware Level (SAL). Steps: (1) Extract features with SAE, (2) Estimate rules via co-occurrence, (3) Judge soundness with LLM, (4) Compute SAL via divergence.

Evaluation Highlights

SAL predicts post-RLVR error rates with high fidelity (R^2 = 0.87) across unseen models from diverse families (Qwen, Mistral, Llama, DeepSeek)
Qwen-2.5-7B achieves a high SAL of ~0.20 (strong separation of sound/unsound rules), while Llama-3.1-8B scores ~0.06 (soundness-agnostic)
The predictive law holds across model scales, with SAL increasing monotonically from 0.5B to 14B parameters within the Qwen family

Breakthrough Assessment

9/10

Establishes a quantitative law connecting microscopic mechanism interpretability (SAE features) directly to downstream macroscopic capability (reasoning potential), a rare and significant bridge in AI science.

⚙️ Technical Details

Problem Definition

Setting: Predicting the post-RLVR error rate (epsilon) of a pre-trained LLM using only its internal representations on unlabeled data

Inputs: Pre-trained LLM weights, unlabeled mathematical corpus

Outputs: Soundness-Aware Level (SAL) score, Predicted post-RLVR Error Rate

Pipeline Flow

Feature Extraction (SAE) -> Rule Discovery (Co-occurrence) -> Soundness Assessment (LLM Judge) -> Metric Calculation (SAL)

System Modules

Cross-Layer SAE

Decode raw hidden activations from the LLM into interpretable sparse features

Model or implementation: Sparse Autoencoder (C=2^15 features)

Rule Estimator

Identify implicit logical rules (Horn clauses) by calculating transition probabilities between features

Model or implementation: Statistical Estimator (Maximum Likelihood with smoothing)

Soundness Judge

Categorize extracted rules into semantic levels (Strict, Plausible, Noise)

Model or implementation: DeepSeek-R1

SAL Calculator

Compute the divergence between the probability distributions of different rule categories

Model or implementation: Jensen-Shannon Divergence (JSD) Formula

Novel Architectural Elements

Formalization of transformer feed-forward steps as probabilistic Horn clauses between SAE features
Use of cross-layer SAEs specifically to map premise-conclusion relationships across layers

Modeling

Base Model: Analyzed variants: Qwen-2.5 (0.5B, 1.5B, 7B, 14B), Mistral-7B-v0.1, Llama-3.1-8B, DeepSeek-Math-7B

Training Method: Analysis of pre-trained models (SAE training only)

Objective Functions:

Purpose: Train SAE to reconstruct hidden states sparsely.

Formally: L = ||x - x_hat||^2 + alpha * ||f||_1 (Reconstruction loss + L1 sparsity penalty)

Training Data:

128K unique mathematical questions from Math, GSM8K, NuminaMath
Model-generated 'think' style responses used for SAE training corpus

Key Hyperparameters:

sae_features: 32768 (2^15)
sae_layers: 8 (evenly spaced)
learning_rate: 2e-4
+ 2 more
sparsity_penalty_alpha: 5e-3
optimizer: AdamW

Compute: Not reported in the paper

Comparison to Prior Work

vs. Behavioral Metrics: SAL is microscopic (internal states) rather than macroscopic (outputs), offering a mechanism-based prediction
vs. Casual Intervention: SAL uses probabilistic co-occurrence (scalable) rather than perturbation (slow, struggles with many-to-one logic)
vs. Pre-RL Accuracy: SAL is a zero-label metric (does not require ground truth solutions for the analysis corpus)

Limitations

Dependency on the quality of SAE feature extraction and interpretability
Reliance on an LLM judge (DeepSeek-R1) for soundness labeling, which introduces its own biases
Computational cost of training SAEs for every target model to be analyzed
Study limited to mathematical reasoning tasks

Reproducibility

No replication artifacts mentioned in the paper (code URL not provided in text). SAE training details and hyperparameters are provided. Benchmark dataset sources are listed.

📊 Experiments & Results

Evaluation Setup

Predicting post-RLVR performance on math tasks using pre-training internal statistics

Benchmarks:

MATH500 (Mathematical Reasoning)
GSM8K (Grade School Math)
AIME 2024 (Competition Math)

Metrics:

Post-RLVR Error Rate (epsilon)
Soundness-Aware Level (SAL)
R-squared (Coefficient of Determination)
Statistical methodology: Leave-one-out cross-validation for the empirical law fit

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
A strong empirical law links the microscopic SAL metric to macroscopic post-RLVR error rates.
Cross-Model Regression	R-squared (Fit)	Not applicable	0.985	Not applicable
Cross-Model Regression	R-squared (Generalization)	Not applicable	0.872	Not applicable
SAL scores reveal significant differences between model families at the same parameter scale (~7B).
Internal Metric	SAL Score	0.058	0.201	+0.143
Internal Metric	SAL Score	0.06	0.11	+0.05
SAL scores scale monotonically with model size within the same family.
Internal Metric	SAL Score	0.06	0.22	+0.16

Experiment Figures

Scatter plot of Post-RLVR Error Rate vs. SAL Score for various models, fitted with an exponential curve

Main Takeaways

High-potential reasoning models are 'soundness-aware': they assign distinct internal probabilities to strict rules vs. noisy correlations.
Low-potential models are 'soundness-agnostic': they collapse probabilities for all rule types into a single distribution, treating noise as fact.
Reasoning potential is an intrinsic property shaped by pre-training and architecture, measurable before any RLVR fine-tuning.
SAL outperforms standard behavioral metrics and even pre-RL benchmark accuracy as a predictor of post-RLVR success.

📚 Prerequisite Knowledge

Prerequisites

Sparse Autoencoders (SAEs) for feature extraction
Logic Programming (Horn Clauses)
Reinforcement Learning with Verifiable Rewards (RLVR)
Probability Theory (Jensen-Shannon Divergence)

Key Terms

SAL: Soundness-Aware Level—a metric measuring how well a model's internal probability distributions distinguish between sound and unsound logic rules

RLVR: Reinforcement Learning with Verifiable Rewards—a training method where models are optimized using objective feedback (e.g., correct/incorrect math answers)

SAE: Sparse Autoencoder—a neural network trained to decompose an LLM's dense hidden states into a sparse set of interpretable features

Horn Clause: A logical rule of the form 'If A and B, then C', used here to represent internal reasoning steps between features

JSD: Jensen-Shannon Divergence—a statistical metric used to measure the similarity between two probability distributions

LLM Judge: Using a high-capability LLM (DeepSeek-R1) to annotate the semantic quality of extracted rules based on feature descriptions

Strict Rule: A logic rule representing necessary truths (e.g., mathematical theorems)

Plausible Rule: A logic rule representing strong heuristics that are usually but not universally true

Noise Rule: A logic rule representing spurious correlations or nonsensical connections