← Back to Paper List

Superbpe: Space travel for language models

A Liu, J Hayase, V Hofmann, S Oh, NA Smith, Y Choi
Not reported in the paper
arXiv, 3/2025 (2025)
Pretraining Benchmark

📝 Paper Summary

LLM Tokenization Efficient Language Modeling
SuperBPE modifies the standard BPE algorithm to learn tokens that bridge whitespace, creating 'superwords' that improve encoding efficiency and downstream model performance compared to standard subword tokenization.
Core Problem
Standard subword tokenization (BPE) assumes tokens must be contained within word boundaries, but whitespace is an unreliable delimiter of meaning, preventing models from efficiently representing common multi-word expressions.
Why it matters:
  • Standard BPE hits diminishing returns as vocabulary size grows, adding rare subwords instead of useful multi-word units
  • Encoding text with more tokens than necessary increases computational costs for both training and inference
  • Limiting tokens to single words ignores linguistic reality where multi-word expressions (e.g., 'by the way') function as single semantic units
Concrete Example: In standard BPE, the phrase 'search engine' is split into two tokens ['search', ' engine']. A SuperBPE tokenizer can merge this frequent sequence into a single token 'search engine', reducing the sequence length and treating the concept as a single unit.
Key Novelty
Two-stage Pre-tokenization Curriculum for BPE (SuperBPE)
  • Stage 1: Run standard BPE with whitespace pre-tokenization enabled to learn basic subword units up to a transition point (e.g., 80k tokens)
  • Stage 2: Disable whitespace pre-tokenization and continue BPE training, allowing the algorithm to merge existing subwords across whitespace boundaries into 'superwords'
  • This curriculum ensures the model learns robust subwords first (avoiding suboptimal merges) before optimizing for multi-word efficiency
Evaluation Highlights
  • +4.0% average improvement over BPE baseline across 30 downstream tasks for an 8B model trained from scratch
  • Encodes text with up to 33% fewer tokens than BPE (at 200k vocabulary size), reducing inference compute by 27%
  • +8.2% improvement on MMLU compared to the BPE baseline (8B scale)
Breakthrough Assessment
8/10
Simple, local modification to tokenization that yields significant efficiency and performance gains without architectural changes. Challenges the long-held subword dogma in LLMs.
×