← Back to Paper List

Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

X Song, K Wang, PX Li, L Yin, S Liu
Not reported in the paper
arXiv, 10/2025 (2025)
Reasoning Benchmark QA RAG

📝 Paper Summary

LLM Interpretability Model Compression Mechanistic Interpretability
LLM depth utilization is highly heterogeneous: likelihood metrics and knowledge tasks rely on shallow layers, while generation metrics and complex reasoning require middle and deep layers, especially in distilled models.
Core Problem
Recent claims that deep layers in LLMs are redundant rely on narrow likelihood-based evaluations, failing to capture the critical role deep layers play in generation coherence and complex reasoning.
Why it matters:
  • Aggressive pruning based on flawed metrics (like log-likelihood) may destroy a model's ability to reason or maintain long-range coherence
  • Understanding where capabilities like retrieval vs. reasoning reside is crucial for efficient model compression and distillation
  • Current benchmarks often overlook the fragility of deep layers in specific tasks like math or multi-step reasoning
Concrete Example: When evaluating a layer-pruned model using standard multiple-choice accuracy (log-likelihood), performance appears stable. However, when the same pruned model is asked to generate a full chain-of-thought solution for a math problem (GSM8K), it fails catastrophically because the reasoning logic residing in deep layers was removed.
Key Novelty
Task- and Metric-Aware Depth Analysis
  • Demonstrates that 'layer importance' is an artifact of the evaluation metric: likelihood metrics hide deep-layer degradation that generation metrics reveal
  • Identifies a functional split in depth: shallow layers handle knowledge/retrieval, while middle/deep layers handle reasoning and coherence
  • Shows that distillation redistributes reasoning capabilities, making them more robust and spread across middle layers rather than just deep ones
Evaluation Highlights
  • Likelihood-based evaluation underestimates pruning impact: on MMLU, log-likelihood shows drops only in early layers, while generation shows drops across all depths
  • Retrieval depends on shallow layers: pruning layers 1-2 causes up to -0.8 accuracy drop on KV Retrieval, while deep layers contribute almost zero
  • Reasoning is fragile in deep layers: pruning specific deep layers (e.g., layer 35 in Qwen) drops GSM8K accuracy by ~60% (delta -0.6)
Breakthrough Assessment
7/10
Provides a crucial correction to the 'deep layers are useless' narrative by proving metric dependence. Useful practical insights for pruning, though the core technique (layer pruning) is standard.
×