Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University,
Shanghai Innovation Institute, Institute of Artificial Intelligence (TeleAI), China Telecom,
University of Science and Technology of China (USTC)
arXiv
(2026)
ReasoningPretraining
📝 Paper Summary
Speculative DecodingEfficient Large Language Model InferenceParameter-Efficient Fine-Tuning (PEFT)
EDA efficiently adapts speculative draft models to fine-tuned target models by decoupling shared and private parameters and training on self-generated data filtered by representation shifts.
Core Problem
Speculative decoding fails when target models are fine-tuned because the draft model's output distribution no longer aligns with the altered target distribution, causing high rejection rates.
Why it matters:
Retraining a full draft model for every specific fine-tuned target model (e.g., Math, Code, Medicine) is prohibitively expensive and inefficient.
Standard speculative decoding loses its speedup benefits on domain-adapted models if the draft model isn't updated, bottlenecking deployment in latency-sensitive applications.
Existing methods assume a fixed target model, lacking mechanisms to efficiently transfer draft models across changing target distributions.
Concrete Example:A draft model trained for a base model (Qwen2.5-7B) works well initially, but when the target is fine-tuned for math (Qwen2.5-Math-7B), the draft model's average acceptance length drops significantly (e.g., from high alignment to near zero) because it doesn't know the new math-specific tokens.
Key Novelty
Efficient Draft Adaptation (EDA) Framework
Decouples the draft model into a frozen 'shared expert' (general knowledge) and a lightweight trainable 'private expert' (domain specific) to handle distribution shifts efficiently.
Regenerates training data using the fine-tuned target model itself (self-generation) to ensure the draft model learns to predict exactly what the target model would generate.
Selects training samples based on Mahalanobis distance in hidden states, prioritizing data where the target model deviates most from the general distribution.
Architecture
Overview of the EDA framework, illustrating the shared-private draft architecture and the data selection pipeline.
Evaluation Highlights
Achieves an average acceptance length of 4.79 when adapting Qwen2.5-7B draft to Qwen2.5-Math-7B, significantly outperforming the baseline adaptation method (4.37).
Reduces training costs to just 60.8% of full draft model retraining while maintaining superior speculative performance.
Demonstrates effective cross-target transfer, restoring speculative speedups on fine-tuned models without the overhead of training monolithic draft models from scratch.
Breakthrough Assessment
7/10
A practical, resource-efficient solution for the growing problem of serving fine-tuned LLMs. Smartly combines architectural decoupling with data selection, though primarily an engineering optimization.
⚙️ Technical Details
Problem Definition
Setting: Aligning a draft model distribution P_theta_d to a shifted target model distribution P_theta_t (e.g., post-SFT) under a compute budget.
Inputs: Prefix sequence x_<t, fine-tuned target model theta_t.
Outputs: Draft model parameters theta_d optimized to maximize acceptance rate of tokens x_t by theta_t.
Pipeline Flow
Data Self-Generation (Target Model creates training set)
Sample Selection (Filter data based on representation shift)
Draft Model Training (Update Private Expert only)
Inference (Speculative Decoding with Shared+Private Experts)
System Modules
Shared-Private Gated Module
Replaces standard FFN in Transformer block; dynamically routes input to shared or private experts.
Model or implementation: Two-expert MoE layer (Shared MLP + Private MLP + Gating)
Data Selector
Selects training samples where target model behavior deviates from general distribution.
Model or implementation: Statistical filter (PCA + Mahalanobis distance)
Novel Architectural Elements
Replacement of standard FFN with a Shared-Private Gated Module specifically for draft model adaptation.
Target model completions (D_self) generated via autoregressive sampling
Filtered subset selected via Mahalanobis score ranking
Key Hyperparameters:
quantile_level_rho: Not reported in the paper
pca_dimensions: Not reported in the paper
Compute: 60.8% of the training cost of full retraining (relative cost reported)
Comparison to Prior Work
vs. Full Retraining: EDA updates fewer parameters and uses selected data.
vs. Standard PEFT (LoRA): EDA applies adaptation specifically to the draft model structure (Shared/Private decomposition) rather than the target model backbone.
vs. Monolithic Draft Models: EDA decouples general vs. specific capabilities, allowing reuse of the shared component.
Limitations
Depends on the availability of a base draft model aligned with the pre-fine-tuning target model.
Requires self-generation step which incurs inference cost on the target model before training.
Effectiveness of Mahalanobis selection depends on the quality of the general reference dataset D_general.
Code available at https://github.com/Lyn-Lucy/Efficient-Draft-Adaptation. Specific hyperparameters for PCA and quantile aggregation (rho) are defined conceptually but exact values not in text. Base models are standard Qwen2.5 family.
📊 Experiments & Results
Evaluation Setup
Speculative decoding on fine-tuned tasks (Math, etc.)
Benchmarks:
Qwen2.5-Math-7B (Mathematical Reasoning)
Metrics:
Average Acceptance Length (tau)
Training Cost (relative to full retraining)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Qwen2.5-Math-7B
Average Acceptance Length
4.37
4.79
+0.42
Qwen2.5-Math-7B
Training Cost (%)
100.0
60.8
-39.2
Experiment Figures
Comparison of average acceptance lengths when using a base draft model on a base target vs. a fine-tuned target.
Main Takeaways
EDA successfully restores and improves average acceptance length for fine-tuned target models compared to naive reuse of base draft models.
The shared-private architecture allows for parameter-efficient adaptation, requiring updates to only a fraction of the parameters.
Self-generation and data selection are critical: training on target-generated data aligns objectives, and selecting high-deviation samples improves data efficiency.
📚 Prerequisite Knowledge
Prerequisites
Speculative Decoding (draft/verify paradigm)
Mixture of Experts (MoE) architectures
Parameter-Efficient Fine-Tuning (PEFT)
Principal Component Analysis (PCA)
Mahalanobis Distance
Key Terms
SFT: Supervised Fine-Tuning—adapting a pre-trained model to a specific task using labeled data.
Average acceptance length: The expected number of tokens generated by the draft model that are accepted by the target model in one forward pass.
Mahalanobis score: A statistical measure used here to quantify how much a token's hidden representation deviates from a general reference distribution.
Self-generation: Using the target model to generate its own training data (completions), ensuring the ground truth for the draft model matches the target's behavior.
Shared/Private Experts: A decomposed architecture where one MLP (Shared) is frozen and retains general knowledge, while another (Private) is trained to learn domain shifts.
EAGLE: A prior speculative decoding method that uses an extra decoder layer (or lightweight head) on top of the target model's features.