Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

📝 Paper Summary

Non-autoregressive generation Sequence modeling Diffusion models for text

Insertion Language Models (ILMs) generate sequences by inserting tokens at arbitrary positions using a denoising objective that predicts missing tokens and their locations simultaneously, enabling flexible, out-of-order generation without fixed-length constraints.

Core Problem

Autoregressive models (ARMs) struggle with planning and constraints due to rigid left-to-right generation, while Masked Diffusion Models (MDMs) suffer from incoherence due to simultaneous unmasking and cannot handle arbitrary-length infilling because mask counts are fixed.

Why it matters:

Rigid left-to-right generation fails on tasks requiring lookahead or complex constraint satisfaction (e.g., Zebra puzzles)
MDMs' reliance on a fixed number of mask tokens makes them unsuitable for tasks where the length of the missing text is unknown (e.g., 'The conference, was postponed' vs 'The conference, originally planned for March, was postponed')
Simultaneous unmasking in MDMs breaks token dependencies, producing incoherent outputs like 'The chef added sugar to the dessert to make it healthier'

Concrete Example: In the sentence 'The chef added <mask1> to the dessert to make it <mask2>', an MDM might simultaneously predict 'sugar' and 'healthier', creating a contradiction. An ILM inserts tokens sequentially (e.g., 'added sugar... to make it sweeter'), preserving coherence.

Key Novelty

Insertion Language Models (ILMs)

Models the generation process as inserting tokens into a sequence one by one, rather than appending to the end (ARMs) or replacing fixed masks (MDMs)
Learns a joint distribution over both the vocabulary item to insert AND the position to insert it, allowing the model to dynamically determine sequence length
Uses a biased denoising objective that trains on normalized counts of dropped tokens between visible tokens, avoiding high-variance Monte Carlo estimates

Architecture

Conceptual comparison of ARM, MDM, and ILM generation processes.

Evaluation Highlights

Achieves 90% sequence accuracy on Zebra Puzzles, significantly outperforming MDMs (55%) and ARMs (40%) on this constraint satisfaction task
Maintains 100% accuracy on 'Starhard' variable-length path planning graphs, whereas MDMs drop to 21% accuracy due to fixed-length constraints
Outperforms MDMs on unconditional text generation (LM1B dataset) with lower Perplexity (3.92 vs 4.05) while matching ARMs in linguistic balance

Breakthrough Assessment

8/10

Strongly addresses the limitations of both ARMs (rigid order) and MDMs (fixed length/incoherence) with a novel insertion-based formulation. empirical results on planning and infilling are compelling.

⚙️ Technical Details

Problem Definition

Setting: Generative modeling of discrete sequences via iterative insertion

Inputs: A partial sequence of tokens x (which may be empty or contain scattered tokens)

Outputs: A new token v and its insertion position k relative to the current sequence x

Pipeline Flow

Input: Partial sequence x (potentially with dropped tokens)
Backbone: Transformer encoder processes x to get hidden states
Head 1 (Insertion): Predicts logits for (position, token) pairs
Head 2 (Stopping): Predicts binary flag to stop generation
Output: Sampled (position, token) pair or Stop signal

System Modules

Transformer Backbone

Encodes the current sequence context

Model or implementation: Transformer with RoPE

Insertion Head

Predicts which token to insert and where

Model or implementation: Linear Unembedding Layer

Stopping Head

Decides when the sequence is complete

Model or implementation: Binary Classifier on <stp> token

Novel Architectural Elements

Joint prediction head over (Position, Token) space allowing arbitrary insertion
Dual-loss objective combining insertion probabilities (based on normalized counts of dropped tokens) and stopping probability

Modeling

Base Model: Transformer with RoPE (approx 85M parameters for text experiments)

Training Method: Supervised Denoising (Insertion-based)

Objective Functions:

Purpose: Learn to insert missing tokens.

Formally: Minimize KL divergence between predicted insertion distribution and 'target insertion distribution' (normalized counts of dropped tokens in original sequence).
Purpose: Learn when to stop generating.

Formally: Binary cross-entropy on stopping classification.

Trainable Parameters: ~85M (comparable to ARM and MDM baselines)

Training Data:

Input sequences are corrupted by dropping a random subset of tokens (represented by bit vector b)
Target distribution is computed by counting occurrences of tokens in the dropped segments

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 512 (effective)
training_steps: 1M (LM1B), 60K (TinyStories)
+ 1 more
optimizer: AdamW

Compute: Trained on 4 A100 GPUs (40GB/80GB)

Comparison to Prior Work

vs. MDMs: ILMs insert tokens (dynamic length) vs MDMs replacing masks (fixed length); ILMs avoid simultaneous unmasking incoherence
vs. ARMs: ILMs can generate out-of-order and fill arbitrary gaps; ARMs are strictly left-to-right
vs. FIM: ILMs handle arbitrary scattered infilling patterns; FIM is limited to single spans

Limitations

Naive denoising objective has high variance, requiring a biased approximate objective (normalized counts)
Inference requires one forward pass per token, slower than MDMs with large step sizes (though quality is better)
Entropy of generated text is slightly lower than MDMs and ARMs
Stopping mechanism adds a separate loss component to balance

Reproducibility

Code: https://dhruveshp.com/projects/ilm

Code is available at project website. Hyperparameters and architecture details (RoPE, hidden sizes) are provided. Dataset preprocessing for 'Stories' (TinyStories + ROCStories) is described.

📊 Experiments & Results

Evaluation Setup

Comparison across synthetic planning tasks and standard text generation benchmarks

Benchmarks:

Star Graphs (Path planning on graphs) [New]
Zebra Puzzles (Constraint satisfaction logic puzzles)
LM1B (Unconditional text generation)
TinyStories + ROCStories (Unconditional text generation)

Metrics:

Sequence Accuracy (for planning)
Negative Log-Likelihood (NLL) under Llama-3.2-3B
Generative Entropy
Prometheus LLM Judge Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Star Graph experiments demonstrate ILM's robustness to variable length paths compared to MDMs.
Star-Hard	Sequence Accuracy	21	100	+79
Star-Hard	Sequence Accuracy	21	100	+79
Zebra Puzzle results show ILM's superior constraint satisfaction capabilities.
Zebra Puzzles	Sequence Accuracy	55	90	+35
Zebra Puzzles	Sequence Accuracy	40	90	+50
Text generation evaluation on LM1B shows ILM outperforms MDM in quality (NLL).
LM1B	NLL (Llama-3.2)	4.05	3.92	-0.13

Experiment Figures

Plot of Generation Quality (NLL) vs Inference Compute (Tokens per step/Steps).

Example generation trajectory for Star Graph task.

Main Takeaways

ILMs consistently outperform MDMs on tasks with variable length requirements (Star Graphs, Infilling) because they don't rely on fixed masks.
ILMs match or exceed ARMs on planning tasks by leveraging out-of-order generation to handle future dependencies.
In text generation, ILMs offer a sweet spot: better coherence than MDMs (which suffer from simultaneous unmasking) and more flexibility than ARMs.
Zero-shot infilling capability is a major advantage, handling arbitrary gaps without specific fine-tuning (unlike FIM-trained ARMs).

📚 Prerequisite Knowledge

Prerequisites

Autoregressive language modeling (ARMs)
Masked Diffusion Models (MDMs)
Denoising objectives
Transformer architecture

Key Terms

ARMs: Autoregressive Models—models that generate text one token at a time from left to right

MDMs: Masked Diffusion Models—models that generate text by iteratively unmasking tokens in a fixed-length sequence

ILMs: Insertion Language Models—the proposed method that generates text by inserting tokens at arbitrary positions

DDiT: Diffusion Transformer architecture—a Transformer backbone with Adaptive Layer Norm used for diffusion models

AdaLN: Adaptive Layer Normalization—normalization layers that condition the model on timestep information

RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers

Zebra Puzzle: A constraint satisfaction logic puzzle requiring the assignment of attributes to entities based on clues

LM1B: One Billion Word Benchmark—a large text corpus used for language modeling evaluation

NLL: Negative Log-Likelihood—a metric measuring how well a model predicts the data (lower is better)

Entropy: A measure of the randomness or diversity in the generated text