Efficient numeracy in language models through single-token number embeddings

📝 Paper Summary

LLM Numeracy Tokenization Strategies

BitTokens encodes numbers into single tokens using their binary IEEE 754 floating-point representation, enabling language models to learn arithmetic algorithms efficiently without extensive reasoning chains.

Core Problem

Current LLMs struggle with basic calculations unless they use excessive reasoning tokens or external tools, and existing single-token encodings (like xVal or FoNE) fail to support efficient learning of arithmetic operations like multiplication.

Why it matters:

Scientific and engineering tasks require processing large amounts of numerical data where external tool use introduces latency and complexity
Reasoning chains are inefficient, requiring thousands of tokens for single calculations (e.g., 5k-30k tokens), limiting the complexity of solvable problems within context windows
Existing sinusoidal encodings make multiplication computationally intensive and prone to errors because they require non-local decoding and re-encoding

Concrete Example: Frontier models like Llama-3.1-405B require thousands of reasoning tokens to solve basic multiplication or division. Meanwhile, sinusoidal encodings (FoNE) fail to learn multiplication because the operation requires a complex convolution in the frequency domain, unlike addition which is simple component-wise multiplication.

Key Novelty

BitTokens (IEEE 754 Binary Encoding)

Encodes numbers as a sequence of bits (sign, exponent, significand) added to a learnable [NUM] token, rather than splitting numbers into digit sequences
leverages the hierarchical structure of IEEE 754 (logarithmic exponent, linear significand) which aligns with efficient binary arithmetic algorithms already used in hardware
Enables arithmetic operations to be learned via bit-wise logic gates (XOR, AND) rather than complex frequency convolutions required by sinusoidal methods

Architecture

Conceptual flow of BitToken construction and integration

Evaluation Highlights

Achieves near-perfect performance on addition, multiplication, and division tasks with a small nanoGPT-2 model, significantly outperforming xVal and FoNE
BitTokens outperforms standard single-digit and subword tokenizers on single-step calculation tasks while using only one token per number
Demonstrates that frontier LLMs (without BitTokens) require 5,000 to 30,000 reasoning tokens to solve a single calculation effectively

Breakthrough Assessment

8/10

Proposes a fundamental shift in how numbers are represented in LLMs, solving the 'arithmetic gap' for single-token representations where previous methods like xVal and FoNE failed on multiplication.

⚙️ Technical Details

Problem Definition

Setting: Sequence modeling where inputs mix text and numerical data requiring arithmetic manipulation

Inputs: Text sequence containing numbers n where n is in the range (-10^15, 10^15) with high precision

Outputs: Predicted text sequence and numerical results encoded as single tokens

Pipeline Flow

Input Processing (Text tokenization + Number detection)
Number Encoding (BitTokens generation)
Model Processing (Transformer layers)
Output Decoding (Text head + Number head)

System Modules

BitToken Encoder

Converts raw numbers into a 64-dimensional binary vector based on IEEE 754 format (sign, exponent, significand)

Model or implementation: Deterministic Algorithm

Language Model

Processes the sequence of text and number embeddings

Model or implementation: nanoGPT-2

Number Head

Predicts the binary vector for the next number when a [NUM] token is predicted

Model or implementation: Linear layer + Sigmoid

Novel Architectural Elements

Hybrid tokenization path where numbers bypass the standard vocabulary and are injected as structured bit-vectors added to a learnable base token
Specialized 'Number Head' that decodes 64+ bits in parallel rather than predicting tokens from a vocabulary

Modeling

Base Model: nanoGPT-2

Training Method: Supervised learning with curriculum learning

Objective Functions:

Purpose: Train the model to correctly predict the bits of the number.

Formally: Bit-wise Binary Cross Entropy (BCE) loss with equal weighting for all bits.
Purpose: Standard language modeling for text.

Formally: Cross-entropy loss on text tokens.

Adaptation: Full training from scratch

Training Data:

30M unique training samples for numeracy tasks
FineWeb 10B dataset for text capabilities

Key Hyperparameters:

number_sampling_range: (-10^15, 10^15)
precision: double-precision (float64)
curriculum_strategy: Dynamic task balancing based on difficulty metrics

Compute: Not reported in the paper

Comparison to Prior Work

vs. xVal: BitTokens supports high precision and large range (-10^15 to 10^15) without restrictive scaling
vs. FoNE: BitTokens enables learning of multiplication and division via bitwise logic, whereas FoNE requires complex de-convolution
vs. Digit/Subword: BitTokens uses exactly one token per number regardless of magnitude or precision, offering far higher efficiency

Limitations

BitTokens struggle with multi-step operations like Mean and Standard Deviation compared to multi-token reasoning methods which allow iterative refinement.
The method requires a dedicated number head and modified loss function, requiring architecture changes unlike standard tokenizers.
Requires numbers to be parsed/identified before tokenization to apply the special encoding.

Reproducibility

Code: https://github.com/AnonymousAuthor553/BitTokens

Code is available at https://github.com/AnonymousAuthor553/BitTokens. The paper details the exact IEEE 754 conversion process and the loss function (BCE). Hyperparameters for nanoGPT-2 are referenced as standard.

📊 Experiments & Results

Evaluation Setup

Evaluation of numeracy on 9 tasks ranging from comparison to complex arithmetic using both frontier LLMs and trained nanoGPT models.

Benchmarks:

Custom Numeracy Benchmark (Arithmetic and Comparison) [New]

Metrics:

log-sMAPE (Logarithmic Symmetric Mean Absolute Percentage Error)
Exact Match Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of small nanoGPT-2 models on basic arithmetic tasks using different tokenization strategies.
Custom Numeracy Benchmark	log-sMAPE	0.1	1.0	+0.9
Custom Numeracy Benchmark	log-sMAPE	0.05	1.0	+0.95
Custom Numeracy Benchmark	log-sMAPE	0.98	1.0	+0.02
Frontier LLM analysis showing the dependency on reasoning tokens.
Custom Numeracy Benchmark	log-sMAPE	0.60	0.60	0.00

Experiment Figures

Radar charts comparing log-sMAPE performance of BitTokens vs. baselines (xVal, FoNE, Single Digit, Subword) across 9 tasks.

Scatter plot of log-sMAPE vs. Reasoning Tokens for frontier models.

Main Takeaways

Frontier LLMs rely heavily on thousands of reasoning tokens to solve basic calculations; without them, they fail at multiplication and division.
BitTokens enables small models to learn addition, multiplication, and division perfectly using only a single token, which previous single-token methods (xVal, FoNE) could not do.
Sinusoidal embeddings (FoNE) are mathematically ill-suited for multiplication in neural networks due to the complexity of convolution in the frequency domain.
While BitTokens excels at single-step arithmetic, multi-token strategies (standard tokenizers) still have an advantage in multi-step problems (like Mean/Std Dev) because they can perform iterative computation over multiple forward passes.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tokenization (BPE, digit splitting)
IEEE 754 floating-point standard (sign, exponent, mantissa/significand)
Basic arithmetic logic (carry propagation, bitwise operations)
Fourier/Sinusoidal embeddings

Key Terms

IEEE 754: The technical standard for floating-point arithmetic, representing numbers using three parts: a sign bit, an exponent, and a significand (fraction)

Significand: The part of a floating-point number that contains the significant digits (also called the mantissa)

xVal: A method that represents numbers by scaling a single learnable token embedding by the number's value

FoNE: Fourier Number Embedding—a method encoding numbers using sinusoidal functions (sines and cosines) of different frequencies

sMAPE: Symmetric Mean Absolute Percentage Error—a metric for measuring prediction accuracy relative to the magnitude of the target

nanoGPT: A small-scale implementation of the GPT architecture used for efficient experimentation

Reasoning Chain: The process where an LLM generates intermediate text steps (like 'Chain of Thought') to solve a problem, rather than outputting the answer immediately

RMSNorm: Root Mean Square Normalization—a technique used in neural networks to stabilize training by normalizing the input vector

BCE Loss: Binary Cross Entropy Loss—a loss function used here to train the model to predict each bit of the floating-point representation correctly