Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

📝 Paper Summary

LLM Quantization Reasoning Models

A systematic study reveals that while 8-bit quantization is safe for reasoning models, 4-bit quantization causes significant degradation, especially on harder tasks and smaller models, with varying sensitivity across model families.

Core Problem

Reasoning models like DeepSeek-R1 improve performance via long chain-of-thought processes but suffer from high inference overheads. Standard quantization methods used for non-reasoning LLMs may degrade these delicate reasoning chains.

Why it matters:

Inference costs for reasoning models are prohibitively high due to extended output lengths (often 100x longer than standard LLMs)
Quantization errors might accumulate over long chain-of-thought sequences, causing the model to deviate from correct logical paths
Existing quantization research focuses on general LLMs, leaving the specific sensitivity of reasoning-specialized models under-explored

Concrete Example: A DeepSeek-R1-Distill-Qwen-1.5B model drops over 10% in accuracy on AIME-120 when using 4-bit weight-activation quantization, whereas non-reasoning tasks often tolerate similar compression with less loss.

Key Novelty

First systematic empirical study of quantized reasoning models

Evaluates impact of Weight, KV Cache, and Activation quantization across varied bit-widths on specialized reasoning models (DeepSeek-R1 distillations, QwQ)
Identifies that task difficulty is a key predictor of quantization failure (harder math problems suffer 4x more degradation than simple ones)
Discovers that unlike standard LLMs, quantized reasoning models do not hallucinate longer outputs but simply fail in accuracy

Architecture

Overview of the study's scope: evaluating Quantization Methods (Weight, Activation, KV Cache) on Reasoning Models (Distilled, RL-based) across various Reasoning Benchmarks.

Evaluation Highlights

W8A8KV8 quantization is near-lossless (<1% drop) across all models and tasks, even for 1.5B models.
Harder tasks like AIME-120 suffer up to 4x greater degradation from quantization than simpler tasks like GSM8K.
DeepSeek-R1-Distill-Qwen-32B drops only 0.4% accuracy with 4-bit weights but crashes by >3% with 3-bit weights.

Breakthrough Assessment

7/10

Provides critical empirical guidance for deploying efficient reasoning models. While not proposing a new algorithm, the comprehensive benchmarking of existing methods on this new model class is highly valuable for practitioners.

⚙️ Technical Details

Problem Definition

Setting: Post-training quantization of reasoning-enhanced Large Language Models

Inputs: High-precision BF16 weights W and activations X of a pre-trained reasoning model

Outputs: Quantized model with lower-precision integers (e.g., INT4, INT8)

Pipeline Flow

Input Reasoning Model (BF16)
Quantization Algorithm Application (Weight / Activation / KV)
Inference on Reasoning Benchmarks

System Modules

Weight Quantization

Reduce precision of linear layer weights

Model or implementation: Various algorithms: GPTQ, AWQ

KV Cache Quantization

Reduce precision of stored Key-Value pairs during generation

Model or implementation: Algorithms: QuaRot, KVQuant*

Activation Quantization

Reduce precision of dynamic activations during matrix multiplication

Model or implementation: Algorithms: SmoothQuant, FlatQuant, MXFP4

Modeling

Base Model: DeepSeek-R1-Distill-Qwen (1.5B, 7B, 14B, 32B), DeepSeek-R1-Distill-LLaMA (8B, 70B), QwQ-32B, Qwen3-8B

Training Method: Evaluation only (Post-Training Quantization applied to pre-trained models)

Adaptation: Quantization only

Trainable Parameters: None (Quantization calibration only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard LLM Quantization studies: This paper focuses specifically on reasoning models with long CoT, finding that they are more sensitive to quantization on hard tasks.
vs. Kurtić et al. (2025): Concurrent work on reasoning model quantization [not cited in paper].

Limitations

Evaluation limited to DeepSeek-R1 and Qwen families; other reasoning models like o1 not tested (likely due to closed access).
Does not propose a new quantization algorithm specifically designed for reasoning models.
Focuses on uniform quantization; mixed-precision strategies are not deeply explored.

Reproducibility

Code: https://github.com/ruikangliu/Quantized-Reasoning-Models

All quantized models and evaluation codes are open-sourced at https://github.com/ruikangliu/Quantized-Reasoning-Models. Experiments use Lighteval with vLLM backend.

📊 Experiments & Results

Evaluation Setup

Zero-shot reasoning evaluation using Lighteval and vLLM

Benchmarks:

AIME-120 (Hard Mathematical Reasoning)
MATH-500 (General Mathematical Reasoning)
GSM8K (Basic Arithmetic Reasoning)
GPQA-Diamond (Scientific Reasoning)
LiveCodeBench (Code Generation)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Experiments repeated with three different seeds to reduce variations

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Weight-only quantization results show that 4-bit is generally safe, but 3-bit causes sharp degradation, especially in smaller models.
Average (Math Benchmarks)	Accuracy	47.7	45.6	-2.1
Average (Math Benchmarks)	Accuracy	74.8	74.4	-0.4
Average (Math Benchmarks)	Accuracy	47.7	40.6	-7.1
Weight-Activation-KV quantization (W4A4KV4) is highly destructive for smaller models but manageable for larger ones using advanced algorithms like FlatQuant.
Average (Math Benchmarks)	Accuracy	74.8	71.9	-2.9
Average (Math Benchmarks)	Accuracy	74.8	63.0	-11.8
Task difficulty analysis confirms harder benchmarks suffer significantly more from quantization.
AIME-120	Accuracy Drop	0.0	-3.9	-3.9
GSM8K	Accuracy Drop	0.0	-0.0	0.0

Main Takeaways

W8A8KV8 is safe for reasoning models, while W4A4KV4 carries significant risks, especially for smaller models (<7B).
AWQ is recommended for weight-only quantization; QuaRot is generally best for KV cache (except on small Qwen models with outlier biases); FlatQuant dominates for 4-bit weight-activation quantization.
Quantization does not increase reasoning output length (token count), but simply degrades the logical correctness of the chain-of-thought.
Reasoning models trained with RL (e.g., QwQ) and Distillation (e.g., DeepSeek-R1-Distill) show different sensitivities to quantization, even when based on the same architecture.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM inference (Weights, Activations, KV Cache)
Basic quantization concepts (Uniform quantization, symmetric vs. asymmetric)
Familiarity with reasoning benchmarks (GSM8K, MATH, AIME)

Key Terms

CoT: Chain-of-Thought—a reasoning process where the model generates intermediate steps before the final answer

KV Cache: Key-Value Cache—stores intermediate attention computations to speed up autoregressive generation

W8A8: Quantization configuration with 8-bit Weights and 8-bit Activations

AWQ: Activation-aware Weight Quantization—a method that protects salient weights based on activation magnitude

GPTQ: Generative Pre-trained Transformer Quantization—a layer-wise quantization method using second-order information

SmoothQuant: A method that migrates quantization difficulty from activations to weights by smoothing activation outliers

FlatQuant: A state-of-the-art quantization method optimized for low-bit weight-activation scenarios

QuaRot: A quantization method using rotation matrices to suppress outliers in weights and activations

AIME-120: A difficult math benchmark consisting of 120 problems from the American Invitational Mathematics Examination

RL-based reasoning: Models that learn reasoning via Reinforcement Learning (e.g., QwQ) rather than just supervised distillation