ConFu: Contemplate the Future for Better Speculative Sampling

📝 Paper Summary

Speculative Decoding Efficient LLM Inference

ConFu improves speculative decoding by augmenting draft models with dynamic 'contemplate tokens' that capture the target model's future reasoning trajectory, reducing error accumulation during generation.

Core Problem

Existing draft models (like EAGLE) condition only on the current prefix, leading to error accumulation where draft distributions drift from the target model's distribution over time.

Why it matters:

Drifting draft distributions cause high rejection rates during verification, negating the speedup benefits of speculative decoding.
Current methods miss the 'semantic trajectory' or high-level plan of the target model, focusing only on immediate next-token prediction.

Concrete Example: In EAGLE, the draft model might initially match the target's hidden states, but after several steps, small errors compound, causing the draft to generate text that diverges from the target's intended meaning (e.g., drifting off-topic), which is then rejected by the target model.

Key Novelty

Future-Aware Drafting via Dynamic Contemplate Tokens

Introduces 'contemplate tokens' (special tokens processed in parallel) that extract the target model's 'thought' or future plan.
Uses a Mixture-of-Experts (MoE) mechanism to make these contemplate tokens dynamic, selecting specific expert embeddings based on the current context.
Feeds this future representation into the draft model as an auxiliary input to guide generation along the target's planned trajectory.

Evaluation Highlights

Improves token acceptance rates and generation speed by 8-11% over EAGLE-3 (state-of-the-art) on Llama-3 3B/8B models.
Achieves consistent speedups across diverse tasks including coding, math, and summarization on SpecBench.
Demonstrates robustness across different sampling temperatures and computation budgets compared to baselines.

Breakthrough Assessment

7/10

Solid advancement in speculative decoding by integrating latent reasoning concepts. It meaningfully improves upon the SOTA (EAGLE-3), though it is an enhancement of existing architectures rather than a complete paradigm shift.

⚙️ Technical Details

Problem Definition

Setting: Speculative decoding for autoregressive LLM inference

Inputs: Input prompt or generated prefix x_{1:n}

Outputs: Generated token sequence x_{n+1:N} verified by target model M_t

Pipeline Flow

Prompt/Prefix → Target Model (with soft prompts)
Target Model generates 'Future Prediction' (via Contemplate Token)
Draft Model (conditioned on Future Prediction) → Candidate Token Tree
Target Model → Verification (Parallel Tree Attention)
Accept/Reject & Update History

System Modules

Future Prediction Module

Extracts the target model's 'thought' or future trajectory

Model or implementation: Target Model (frozen) + Learnable Soft Prompts + Dynamic Contemplate Token MoE

Draft Model

Generates a tree of candidate tokens

Model or implementation: Single-layer Transformer (similar to EAGLE)

Verifier

Verifies candidate tokens and generates next future prediction simultaneously

Model or implementation: Target Model (Frozen)

Novel Architectural Elements

Dynamic Contemplate Token mechanism using MoE to select context-aware 'thought' embeddings
Integration of soft prompts specifically to elicit reasoning signals for drafting
Parallel generation of future predictions for every node in the draft tree during verification

Modeling

Base Model: Llama-3-8B-Instruct and Llama-3-3B-Instruct (Target Models)

Training Method: Supervised learning (Train-time testing framework)

Objective Functions:

Purpose: Minimize difference between draft distribution and target distribution.

Formally: Loss = KL(P_{M_t}(·|x) || P_{M_d}(·|x, h_{M_d}, f))
Purpose: Encourage robustness of future prediction.

Formally: Use the future prediction f_t from an anchor token x_t to predict a window of subsequent tokens x_{t+j}.

Adaptation: Trains only the Draft Head (1 layer), Soft Prompts, and MoE modules. Target model is FROZEN.

Key Hyperparameters:

soft_prompt_tokens (s): 16
draft_tree_size (T): Typically around 30 (implied from 'moderate T=30')
learning_rate: Not reported in the paper
+ 1 more
batch_size: Not reported in the paper

Compute: Inference involves 2T tokens per verification step (T draft tokens + T contemplate tokens).

Comparison to Prior Work

vs. EAGLE-3: ConFu adds 'future prediction' input via contemplate tokens; EAGLE conditions only on history.
vs. BiTA: BiTA tries to *decode* text from the contemplate token; ConFu uses the contemplate token's *hidden state* to guide a separate lightweight draft model.
vs. Medusa [not cited in paper]: Medusa uses multiple heads to predict future tokens simultaneously; ConFu uses a sequential autoregressive draft model guided by a future vector.

Limitations

Increases computational cost of verification (2T tokens vs T tokens) due to inserting contemplate tokens for every draft node.
Requires careful training of soft prompts to ensure they capture useful 'future' information without fine-tuning the target.
Implementation complexity is higher than standard draft models due to tree attention modifications and MoE handling.

Reproducibility

No replication artifacts mentioned in the paper. Code URL not provided. Hyperparameters like learning rate and batch size are missing.

📊 Experiments & Results

Evaluation Setup

Speculative decoding on standard benchmarks

Benchmarks:

SpecBench (Diverse tasks (Translation, QA, Math, Coding, Summarization))

Metrics:

Token Acceptance Rate
Generation Speed (Speedup ratio)
Wall-clock time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Llama-3-8B-Instruct shows consistent improvements in speedup and acceptance rate over EAGLE-3.
SpecBench (Llama-3-8B)	Acceptance Rate Improvement	Not reported in the paper	Not reported in the paper	+8-11% (qualitative summary of range)

Main Takeaways

ConFu consistently outperforms EAGLE-3 in both speed and acceptance rate across Llama-3 3B and 8B models.
Improvements are observed across all task categories in SpecBench (Coding, Math, Writing, etc.).
The method is effective across different sampling temperatures and computation budgets.
The 'future prediction' mechanism successfully mitigates the error accumulation problem inherent in prefix-only draft models.

📚 Prerequisite Knowledge

Prerequisites

Speculative Sampling/Decoding
Transformer Architecture (KV Cache, Attention)
Mixture-of-Experts (MoE)
Latent Reasoning / Pause Tokens

Key Terms

Speculative Decoding: An inference technique where a small 'draft' model proposes tokens that are verified in parallel by a large 'target' model to speed up generation.

Draft Model: A smaller, faster model (or head) used to propose candidate tokens.

Target Model: The main, large LLM whose output distribution must be matched.

Contemplate Token: A special token (aka pause token) added to the sequence to allow the model to perform extra computation or express internal reasoning states without generating visible text.

Soft Prompts: Learnable vectors prepended to the input that guide the model's behavior (here, to produce future predictions) without changing model weights.

MoE (Mixture-of-Experts): A neural architecture where different sub-networks (experts) are activated for different inputs based on a gating mechanism.

Tree Attention: An attention mechanism allowing verification of multiple branching draft token sequences in a single forward pass.

Anchor Token Sampling: A training strategy where contemplate tokens are inserted only at random positions (anchors) rather than every position to save memory.

KL-divergence: A statistical measure quantifying how one probability distribution differs from a second, reference probability distribution.