Entropy-Based Block Pruning for Efficient Large Language Models

📝 Paper Summary

Model Compression Large Language Model Inference Efficiency

EntroDrop prunes redundant transformer blocks by measuring entropy increase in hidden states, identifying layers that contribute little new information compared to geometric similarity metrics.

Core Problem

Large Language Models are computationally expensive, but existing pruning methods rely on cosine similarity, which measures geometric alignment rather than actual information richness.

Why it matters:

LLMs scale to billions of parameters, creating massive storage and compute demands for deployment
Existing metrics like cosine similarity may misidentify redundant blocks, leading to suboptimal pruning decisions that degrade model accuracy
Attention blocks in particular are highly redundant but computationally expensive

Concrete Example: In layers 3-32 of Llama-3.1-8B, entropy gradually increases. A cosine-based method might keep a layer because its output vector is geometrically distinct, even if its entropy change is minimal (meaning it adds little new information/uncertainty reduction).

Key Novelty

Entropy-Based Block Pruning (EntroDrop)

Analyzes information flow via entropy dynamics: early layers compress information (entropy decreases), while later layers enrich it (entropy increases)
Uses entropy increase as a direct proxy for 'information contribution'; layers with minimal entropy increase in the enrichment stage are deemed redundant and pruned
Replaces the de facto cosine similarity metric with entropy estimation (bucket-based or KNN) for more reliable redundancy detection

Architecture

The EntroDrop framework pipeline. It illustrates the process of feeding calibration data, estimating entropy per block, ranking blocks by entropy increase, and pruning the least informative ones.

Evaluation Highlights

Removing 12 attention layers (37.5% of total) in Llama-3.1-8B retains >95% of original performance across multiple benchmarks
Outperforms cosine similarity-based baselines (ShortGPT, LaCo, LLMDrop) on MMLU and reasoning tasks while reducing inference latency
Inference speed increases linearly with pruning; removing 12 layers provides significant speedup with minimal accuracy loss

Breakthrough Assessment

7/10

Offers a theoretically grounded shift from geometric (cosine) to information-theoretic (entropy) pruning metrics. Strong empirical results on modern LLMs, though the method is a refinement of existing block pruning rather than a new architecture.

⚙️ Technical Details

Problem Definition

Setting: Post-training structured pruning of Transformer blocks without fine-tuning

Inputs: Pre-trained LLM and a small calibration dataset D

Outputs: Pruned model with subset of original blocks

Pipeline Flow

Calibration Pass: Run data through model to collect hidden states
Entropy Estimation: Calculate entropy of hidden states at each block
Stage Identification: Detect transition from entropy decrease (compression) to increase (enrichment)
Ranking & Pruning: Rank blocks in enrichment stage by entropy increase; remove bottom-K

System Modules

Entropy Estimator

Estimate entropy of hidden state activations to quantify information content

Model or implementation: Bucket-based or KNN-based estimator

Pruning Selector

Select blocks for removal based on minimal entropy increase

Model or implementation: Ranking logic

Novel Architectural Elements

Use of entropy dynamics (specifically the transition from decrease to increase) to define pruning eligibility zones

Modeling

Base Model: Llama-3.1-8B and Mistral-7B-v0.3

Compute: Inference-only method. Experiments run on a single 40G A100 GPU.

Comparison to Prior Work

vs. ShortGPT/LaCo: Uses entropy increase instead of cosine similarity; focuses on information richness rather than geometric alignment
vs. LLMDrop: Shared focus on attention blocks, but EntroDrop uses entropy metric which correlates better with performance preservation
vs. SliceGPT [not cited in paper]: SliceGPT prunes rows/columns (structured) via matrix decomposition, whereas EntroDrop removes entire blocks

Limitations

Renyi entropy estimation performs poorly, requiring careful selection of estimation method (Bucket/KNN)
Relies on calibration data, though shown to be relatively robust to domain shifts
Pruning is static; does not adapt dynamically per sample during inference

Reproducibility

Code: https://github.com/SalesforceAIResearch/EntroDrop

Code is publicly available at https://github.com/SalesforceAIResearch/EntroDrop. Calibration datasets used are standard (C4, Wikitext, etc.). Hyperparameters for entropy estimation (bins, K) are analyzed in the paper.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on reasoning and QA benchmarks after pruning

Benchmarks:

MMLU (General Knowledge)
HellaSwag (Commonsense Reasoning)
ARC-C (Scientific QA)
WinoGrande (Commonsense Reasoning)
PIQA (Physical Commonsense)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on Llama-3.1-8B when pruning 20% of layers (approx 6-7 layers). EntroDrop consistently outperforms similarity-based baselines.
MMLU	Accuracy	58.2	64.3	+6.1
MMLU	Accuracy	63.7	64.3	+0.6
ARC-C	Accuracy	46.2	50.1	+3.9
Comparative performance on Mistral-7B-v0.3 when pruning 20% of layers.
MMLU	Accuracy	56.8	59.2	+2.4

Experiment Figures

Entropy dynamics across layers for Llama-3.1-8B and Mistral-7B. It shows a 'check-mark' shape: entropy drops in early layers (1-3) and rises in later layers.

Trade-off between Inference Time, Accuracy, and Number of Pruned Layers.

Main Takeaways

Information flow in LLMs has two stages: initial entropy decrease (compression) followed by entropy increase (enrichment).
Entropy-based pruning identifies redundant layers more effectively than cosine similarity, preserving higher accuracy at the same sparsity.
The method is robust to calibration dataset choice; using medical or law text for calibration yields similar pruning masks to general text.
Bucket-based and KNN entropy estimators work well, while Renyi entropy is unstable/ineffective for this task.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention vs. MLP blocks)
Information theory (Entropy)
Model pruning/compression techniques

Key Terms

Entropy: A measure of uncertainty or information content in a probability distribution; here used to quantify information richness of hidden states

Cosine Similarity: A metric measuring the cosine of the angle between two vectors, commonly used to assess geometric similarity between layer inputs and outputs

KNN: K-Nearest Neighbors—a non-parametric method used here to estimate entropy by looking at the density of local data points

Renyi Entropy: A generalization of Shannon entropy; found to be less effective than Shannon entropy for this specific pruning task

Block-wise Pruning: Removing entire computational units (like a whole Attention layer) rather than individual weights

MMLU: Massive Multitask Language Understanding—a benchmark covering diverse knowledge domains