← Back to Paper List

How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou
University of Maryland, University of Chicago
arXiv.org (2025)
Reasoning Benchmark RL

📝 Paper Summary

LLM Post-Training Analysis Data Quality Evaluation Training Dynamics
This paper reveals that high-quality and reasoning data induce gradients with lower magnitudes but higher directional diversity (effective rank) during fine-tuning, unifying various data quality metrics through spectral analysis.
Core Problem
While data quality is known to be crucial for LLM post-training, the underlying mechanism of how different data qualities and types (e.g., instruction vs. reasoning) affect training dynamics and gradients remains largely unexplored.
Why it matters:
  • Current data selection metrics (like IFD or Reward) are treated as black-box preprocessing steps without understanding their impact on model optimization
  • There is a lack of systematic comparison between the learning dynamics of general instruction-following data and complex reasoning data
  • Understanding gradient behaviors could lead to more stable and efficient data synthesis and selection strategies
Concrete Example: When training on simple data, gradients may have a large magnitude (high nuclear norm) but focus on few directions (low rank). In contrast, the paper finds that complex reasoning data (like s1.1) induces gradients with smaller magnitudes but much higher effective ranks, implying the model is updating parameters in a more diverse and structurally rich way.
Key Novelty
Spectral Unification of Data Quality Metrics
  • Applies Singular Value Decomposition (SVD) to layer-wise gradients to define metrics like Nuclear Norm (magnitude) and Effective Rank (diversity)
  • Demonstrates that disparate data quality metrics (IFD, InsTag, Difficulty, Reward) all map to consistent spectral properties: high quality corresponds to low nuclear norm and high effective rank
Evaluation Highlights
  • Analyzed gradients across 4 diverse model families (Qwen2.5, Llama3.1, Llama3.2, Gemma2) ranging from 1.5B to 14B parameters
  • Identified that reasoning data (s1.1) achieves substantially higher effective ranks than standard instruction data, correlating reasoning complexity with gradient diversity
  • Established that effective rank is a more robust indicator of data quality than gradient magnitude (nuclear norm), distinguishing subtle differences in complex tasks
Breakthrough Assessment
7/10
Provides a novel theoretical lens (spectral analysis) to explain empirical data quality metrics. While it doesn't propose a new model, it offers significant insights into *why* certain data works better, unifying disjoint metrics.
×