Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

📝 Paper Summary

LLM Interpretability Model Compression Mechanistic Interpretability

LLM depth utilization is highly heterogeneous: likelihood metrics and knowledge tasks rely on shallow layers, while generation metrics and complex reasoning require middle and deep layers, especially in distilled models.

Core Problem

Recent claims that deep layers in LLMs are redundant rely on narrow likelihood-based evaluations, failing to capture the critical role deep layers play in generation coherence and complex reasoning.

Why it matters:

Aggressive pruning based on flawed metrics (like log-likelihood) may destroy a model's ability to reason or maintain long-range coherence
Understanding where capabilities like retrieval vs. reasoning reside is crucial for efficient model compression and distillation
Current benchmarks often overlook the fragility of deep layers in specific tasks like math or multi-step reasoning

Concrete Example: When evaluating a layer-pruned model using standard multiple-choice accuracy (log-likelihood), performance appears stable. However, when the same pruned model is asked to generate a full chain-of-thought solution for a math problem (GSM8K), it fails catastrophically because the reasoning logic residing in deep layers was removed.

Key Novelty

Task- and Metric-Aware Depth Analysis

Demonstrates that 'layer importance' is an artifact of the evaluation metric: likelihood metrics hide deep-layer degradation that generation metrics reveal
Identifies a functional split in depth: shallow layers handle knowledge/retrieval, while middle/deep layers handle reasoning and coherence
Shows that distillation redistributes reasoning capabilities, making them more robust and spread across middle layers rather than just deep ones

Evaluation Highlights

Likelihood-based evaluation underestimates pruning impact: on MMLU, log-likelihood shows drops only in early layers, while generation shows drops across all depths
Retrieval depends on shallow layers: pruning layers 1-2 causes up to -0.8 accuracy drop on KV Retrieval, while deep layers contribute almost zero
Reasoning is fragile in deep layers: pruning specific deep layers (e.g., layer 35 in Qwen) drops GSM8K accuracy by ~60% (delta -0.6)

Breakthrough Assessment

7/10

Provides a crucial correction to the 'deep layers are useless' narrative by proving metric dependence. Useful practical insights for pruning, though the core technique (layer pruning) is standard.

⚙️ Technical Details

Problem Definition

Setting: Analyzing the performance degradation of LLMs when specific layers or heads are removed (pruned) during inference.

Inputs: Prompt x (e.g., MMLU question, GSM8K math problem)

Outputs: Generated text y or probability distribution over options

Pipeline Flow

Input Prompt
Transformer Layers 1 to N (with specific layer L skipped/pruned)
Output Generation / Classification

System Modules

Transformer Block

Process hidden states; shallow blocks handle syntax/retrieval, deep blocks handle reasoning

Model or implementation: Llama-3.1-8B / Qwen3-8B / Llama-1-7B

Modeling

Base Model: Llama-3.1-8B, Qwen3-8B, Llama-1-7B

Training Method: Analysis of existing pre-trained and distilled models (no new training method proposed)

Compute: Not reported in the paper

Comparison to Prior Work

vs. 'Curse of Depth' studies: This paper differentiates between metric types, showing deep layers are critical for generation even if redundant for likelihood.
vs. Standard Pruning: Focuses on the *analysis* of where capabilities reside rather than proposing a new pruning algorithm.

Limitations

Analysis is limited to 7B-8B parameter models; effects might differ in 70B+ models
Focuses primarily on layer removal, not more complex merging or compression techniques
Does not propose a novel method to fix the fragility of deep layers, only diagnoses it

Reproducibility

Code: https://github.com/Hik289/llm-layer-importance.git

Code is publicly available at https://github.com/Hik289/llm-layer-importance.git. The paper analyzes standard open weights models (Llama-3.1, Qwen3) on standard benchmarks (MMLU, GSM8K, HellaSwag).

📊 Experiments & Results

Evaluation Setup

Layer-wise ablation study across diverse tasks and metrics.

Benchmarks:

MMLU (General Knowledge)
HellaSwag (Commonsense Reasoning)
MathQA (Mathematical Problem Solving)
GSM8K (Chain-of-Thought Reasoning)
KV Retrieval (Retrieval)

Metrics:

Accuracy (acc)
Relative Accuracy Change (Delta mu)
Cross-entropy based accuracy (acc_ce)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Knowledge and retrieval tasks show heavy reliance on shallow layers, with deep layers contributing little.
HellaSwag	Accuracy (acc_ce)	0.00	-0.50	-0.50
KV Retrieval	Accuracy (Delta mu)	0.00	-0.80	-0.80
Reasoning tasks using generation metrics reveal critical dependencies on deep layers.
GSM8K (1-shot CoT)	Accuracy (Delta mu)	0.00	-0.60	-0.60
GSM8K	Accuracy (Delta acc)	0.00	-0.60	-0.60
Distillation redistributes reasoning capabilities, involving middle layers more than the base model.
GSM8K CoT	Accuracy (Delta mu)	0.00	-0.06	-0.06

Experiment Figures

Performance degradation on MMLU across layers using different evaluation metrics.

Head pruning results on Qwen3-8B Layer 35 for GSM8K.

Main Takeaways

Evaluation protocol determines 'layer importance': Likelihood metrics suggest deep layers are redundant, while generation metrics show they are essential.
Capabilities are localized: Retrieval/Knowledge is concentrated in shallow layers (and specific shallow heads), while Reasoning is concentrated in middle/deep layers.
Distillation effect: Distilled models are more robust in deep layers but rely heavily on shallow-to-mid representations to maintain their performance gains.
Sparse Criticality: In deep layers, reasoning often depends on a very small subset of specific attention heads (e.g., in Layer 35 of Qwen).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (layers, attention heads)
Model pruning and compression techniques
Chain-of-Thought (CoT) prompting
Knowledge Distillation

Key Terms

Log-likelihood default: Evaluating a model by calculating the probability of pre-defined options (e.g., A, B, C, D) given the context

Generation until: Evaluating a model by letting it autoregressively generate text until a stop token, then checking the answer

Layer Pruning: Removing an entire Transformer block from the network and connecting the previous block directly to the next

KV Retrieval: A task where the model must retrieve a specific value associated with a key from its context or memory

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

Distillation: Training a smaller student model to mimic the behavior or output distribution of a larger teacher model

Delta Model: Analyzing the difference in weights or representations between a base model and its fine-tuned or distilled version

GQA: Grouped Query Attention—an efficiency technique where multiple query heads share a single key-value head