โ† Back to Paper List

ConFu: Contemplate the Future for Better Speculative Sampling

Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun
University of California, Los Angeles, Qualcomm AI Research
arXiv (2026)
Reasoning Benchmark

๐Ÿ“ Paper Summary

Speculative Decoding Efficient LLM Inference
ConFu improves speculative decoding by augmenting draft models with dynamic 'contemplate tokens' that capture the target model's future reasoning trajectory, reducing error accumulation during generation.
Core Problem
Existing draft models (like EAGLE) condition only on the current prefix, leading to error accumulation where draft distributions drift from the target model's distribution over time.
Why it matters:
  • Drifting draft distributions cause high rejection rates during verification, negating the speedup benefits of speculative decoding.
  • Current methods miss the 'semantic trajectory' or high-level plan of the target model, focusing only on immediate next-token prediction.
Concrete Example: In EAGLE, the draft model might initially match the target's hidden states, but after several steps, small errors compound, causing the draft to generate text that diverges from the target's intended meaning (e.g., drifting off-topic), which is then rejected by the target model.
Key Novelty
Future-Aware Drafting via Dynamic Contemplate Tokens
  • Introduces 'contemplate tokens' (special tokens processed in parallel) that extract the target model's 'thought' or future plan.
  • Uses a Mixture-of-Experts (MoE) mechanism to make these contemplate tokens dynamic, selecting specific expert embeddings based on the current context.
  • Feeds this future representation into the draft model as an auxiliary input to guide generation along the target's planned trajectory.
Evaluation Highlights
  • Improves token acceptance rates and generation speed by 8-11% over EAGLE-3 (state-of-the-art) on Llama-3 3B/8B models.
  • Achieves consistent speedups across diverse tasks including coding, math, and summarization on SpecBench.
  • Demonstrates robustness across different sampling temperatures and computation budgets compared to baselines.
Breakthrough Assessment
7/10
Solid advancement in speculative decoding by integrating latent reasoning concepts. It meaningfully improves upon the SOTA (EAGLE-3), though it is an enhancement of existing architectures rather than a complete paradigm shift.
×