Scalable Language Models with Posterior Inference of Latent Thought Vectors

📝 Paper Summary

Language Modeling Latent Variable Models Reasoning/Chain-of-Thought

Latent Thought Models introduce explicit latent vectors that guide token generation, enabling a dual-rate learning process where inference-time computation improves performance without increasing model parameters.

Core Problem

Traditional Large Language Models (LLMs) rely on massive model sizes and data scaling for performance, but data availability is becoming a bottleneck and standard auto-regressive models lack explicit internal reasoning states.

Why it matters:

Scaling laws require exponentially more data/compute for marginal gains, hitting data scarcity walls
Current LLMs lack a separation between 'fast' episodic learning (inference-time adaptation) and 'slow' schematic learning (weight updates)
Inference-time compute is an under-utilized dimension for scaling performance compared to just adding more parameters

Concrete Example: In standard GPT training, the model must predict the next token immediately based on context. In LTMs, the model first performs 'fast learning' (optimization steps) to find the best latent vector *z* for the current sequence, essentially 'thinking' before generating, which allows a small 76M parameter model to match the perplexity of a 774M parameter GPT-2 Large.

Key Novelty

Latent Thought Models (LTMs) with Dual-Rate Optimization

Introduces 'latent thought vectors' (z) that act as abstract, structured representations of a sequence, conditioning the generation of every token
Uses a dual-rate optimization: 'fast learning' (inference-time optimization of z per sequence) and 'slow learning' (standard gradient updates for global model weights)
Treats inference steps as a scaling dimension: performing more optimization steps on z during inference improves results without retraining the main model

Architecture

Probabilistic graphical model and architecture of LTM. Shows latent thought vectors z controlling the generation of token sequence x.

Evaluation Highlights

LTM-Large (76M parameters) achieves 3.05 validation perplexity on OpenWebText, outperforming GPT-2 Large (774M parameters) which has ~10x more parameters
Zero-shot language modeling perplexity reduced by 91.7% compared to state-of-the-art results at GPT-2 scale
Demonstrates emergent few-shot in-context arithmetic reasoning in small models (e.g., LTM-Small), a capability usually reserved for much larger LLMs

Breakthrough Assessment

8/10

Proposes a fundamental architectural shift from pure autoregression to latent-guided generation with a practical inference-time compute scaling law. Strong empirical results on efficiency make it a significant contribution.

⚙️ Technical Details

Problem Definition

Setting: Generative modeling of token sequences conditioned on latent variables

Inputs: Sequence of ground tokens x = (x^(0), ..., x^(N))

Outputs: Probability distribution over tokens p(x|z) and latent thought vectors z

Pipeline Flow

Latent Initialization (Sample z from prior or approximate posterior)
Fast Inference (Optimize z via gradient descent on ELBO)
Thought-Guided Generation (Transformer decoder generates x conditioned on optimized z)

System Modules

Prior Model (Latent Space)

Defines the distribution of latent vectors before seeing data

Model or implementation: Isotropic Gaussian N(0, I)

Variational Posterior (Latent Space)

Approximates the true posterior p(z|x) for a specific sequence x

Model or implementation: Gaussian N(μ, σ²)

Transformer Decoder

Generates tokens autoregressively while attending to latent vectors

Model or implementation: Transformer Decoder with Cross-Attention

Novel Architectural Elements

Layered Thought Vectors: Distinct sets of latent vectors attend to different layers of the decoder
Inference-as-Optimization: The 'forward pass' involves an inner loop of gradient descent steps to optimize z before generation
Short-context forcing: Intentionally small context window (k=256) forces the model to rely on latent z for long-range dependency information

Modeling

Base Model: Custom Transformer Decoder (GPT-2 scale)

Training Method: Classical Variational Bayes with Dual-Rate Optimization

Objective Functions:

Purpose: Maximize the Evidence Lower Bound (ELBO).

Formally: E_q(z|x)[log p_β(x|z)] - KL(q(z|x) || p(z))

Training Data:

OpenWebText dataset

Key Hyperparameters:

inference_steps_T_fast: 16
fast_learning_rate: 0.3
global_learning_rate: 0.0004
+ 2 more
context_window_k: 256
latent_prior: Gaussian N(0, I)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GPT-2: LTM uses latent vectors z and requires inference-time optimization; GPT-2 is purely autoregressive
vs. Diffusion Models: LTM uses a structured latent space with variational inference; diffusion models use iterative denoising in token/embedding space
vs. VAE [standard]: LTM uses per-instance optimization (classical VB) instead of an amortized encoder network to avoid posterior collapse

Limitations

Requires iterative gradient updates during inference, increasing latency compared to standard AR models
Currently uses a simple isotropic Gaussian prior, which may limit expressivity compared to learned priors
Analysis limited to GPT-2 scales; scaling to very large models (70B+) remains unverified
Short context window (256) relies heavily on the compression capacity of z

Reproducibility

Code: https://deqiankong.github.io/blogs/ltm

Project page available at https://deqiankong.github.io/blogs/ltm. Code availability explicitly stated as 'project page is available', implying code might be linked there or forthcoming, but no direct GitHub link in text.

📊 Experiments & Results

Evaluation Setup

Pretraining on OpenWebText and evaluation on perplexity, zero-shot modeling, and generation tasks

Benchmarks:

OpenWebText (Language Modeling (Perplexity))
Lambada (Zero-shot completion)
WikiText-103 (Language Modeling)

Metrics:

Perplexity (PPL)
MAUVE
Gen PPL (Generative Perplexity)
Entropy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pretraining efficiency comparisons showing LTMs match larger GPT-2 models with fewer parameters.
OpenWebText	Perplexity	11.5	10.95	-0.55
OpenWebText	Perplexity	11.5	3.05	-8.45
Zero-shot language modeling performance comparisons.
OpenWebText (Zero-shot)	Perplexity Reduction	Not reported in the paper	Not reported in the paper	Not reported in the paper
Text generation quality measured by MAUVE score.
OpenWebText (Generation)	MAUVE	0.89	0.93	+0.04

Experiment Figures

Scaling behavior of LTM performance (Perplexity) against Inference Steps.

Trade-off between Compute Efficiency and Sample Efficiency.

Main Takeaways

Inference Steps Scaling: Performance improves monotonically with the number of inference steps (optimization iterations for z), offering a trade-off between compute and accuracy.
Latent Size Scaling: Increasing the number/dimension of latent vectors improves performance, acting as another scaling axis.
Parameter Efficiency: LTMs achieve comparable or better perplexity than standard Transformers with significantly fewer parameters (e.g., 5-6% of GPT-2 Large parameters).
Compute Trade-off: LTMs introduce a trade-off between training compute (trFLOPs) and inference compute; spending more compute at inference time allows for smaller models.

📚 Prerequisite Knowledge

Prerequisites

Variational Bayes / Variational Inference
Transformer Architecture (Decoder-only)
Autoregressive Language Modeling
Langevin Dynamics (for context)

Key Terms

Latent Thought Vectors: Continuous vector representations (z) that exist in a latent space and guide the generation of the visible token sequence

Dual-rate optimization: A training scheme alternating between optimizing local parameters (latent vectors specific to a sequence) and global parameters (model weights shared across all data)

Inference-time compute: Computational effort spent during the generation phase (specifically optimizing latent vectors) to improve output quality, distinct from training compute

ELBO: Evidence Lower Bound—a proxy objective function used in variational inference to approximate the intractable true likelihood

Posterior collapse: A failure mode in VAEs where the model ignores the latent variable z and generates based solely on the autoregressive decoder

trFLOPs/tok: Training floating-point operations per token—a metric for the total computational cost of training

Cross-attention: Attention mechanism where the model attends to the latent thought vectors (keys/values) using the text sequence as queries

Langevin dynamics: An iterative method for sampling from a probability distribution using gradients and noise injection

MAUVE: A metric for evaluating text generation quality by comparing the distribution of generated text to human text