Continuous Chain of Thought Enables Parallel Exploration and Reasoning

📝 Paper Summary

Chain of Thought (CoT) Latent Reasoning Search and Planning

CoT2 replaces discrete tokens with continuous vectors to enable tracking multiple reasoning paths in parallel within a single trace, constrained by embedding dimension rather than vocabulary size.

Core Problem

Standard Chain-of-Thought relies on discrete token sampling, which forces early commitment to single reasoning paths and limits information flow to log(v) bits per step.

Why it matters:

Discrete sampling causes 'snowballing errors' where early mistakes derail the entire reasoning chain, preventing exploration of alternatives without expensive multi-path decoding
The information bottleneck of discrete tokens wastes the high-dimensional capacity of modern embeddings (d >> log v), limiting how much context or state can be passed forward

Concrete Example: In the Minimum Non-Negative Sum (MNNS) task, a discrete model must commit to a specific + or - sign for a number at each step. If it picks the wrong sign early, it cannot recover. CoT2 maintains a superposition of both possibilities in a continuous vector until the final decision.

Key Novelty

Chain of Thought with Continuous Tokens (CoT2)

Instead of sampling one token, the model outputs a weighted sum (convex combination) of token embeddings, effectively superposing multiple potential thoughts
Introduces 'budget-constrained supervision' (CSFT) where the model is trained to match the distribution of the top-B best reasoning trajectories, enabling parallel exploration
Proposes Multi-Token Sampling (MTS) for inference, which averages K discrete tokens to approximate the ideal continuous state, bridging deterministic and generative reasoning

Architecture

Illustration of the supervision strategy comparing Discrete CoT (single path) vs. CoT2 (superposition of multiple paths).

Evaluation Highlights

CoT2 with full trajectory budget achieves near 100% accuracy on MNNS (Minimum Non-Negative Sum), significantly outperforming Discrete CoT which struggles with early commitments
Pass@1 performance of CoT2 matches or exceeds Pass@k of discrete baselines, effectively emulating parallel search within a single forward pass
Policy optimization (RL) with CoT2 further improves accuracy over supervised baselines on logical reasoning tasks like ProntoQA and ProsQA

Breakthrough Assessment

8/10

Offers a theoretically grounded and empirically effective method to overcome the discrete branching bottleneck of LLMs, enabling 'single-trace' parallel search.

⚙️ Technical Details

Problem Definition

Setting: Next-token prediction where intermediate 'thought' steps are continuous vectors z_t in R^d, ending in a final discrete answer z_m

Inputs: Input context X in R^{n x d}

Outputs: Sequence of m tokens, where z_{1:m-1} are continuous and z_m is discrete

Pipeline Flow

Encoder (processes context X)
Continuous Decoder (autoregressively generates m-1 continuous tokens z_t)
Discrete Head (outputs final discrete answer z_m)

System Modules

Continuous Decoder

Generate intermediate reasoning states as continuous vectors

Model or implementation: Transformer (GPT-2 based in experiments)

Discrete Head

Produce final human-readable answer

Model or implementation: Linear projection + Softmax

Novel Architectural Elements

Input injection of continuous vectors: The model feeds its own softmax-weighted output vectors back as input for the next step, rather than sampling discrete IDs

Modeling

Base Model: GPT-2 architecture (trained from scratch)

Training Method: Continuous Supervised Fine-Tuning (CSFT) followed by GRPO (RL)

Objective Functions:

Purpose: Align predicted continuous distribution with target distribution of top-B trajectories.

Formally: L_CSFT = sum D(alpha_t* || alpha_t) for t < m, where D is cross-entropy/KL.
Purpose: Ensure correct final answer.

Formally: Standard cross-entropy loss on the final discrete token z_m.
Purpose: Optimize policy using reinforcement learning.

Formally: GRPO objective maximizing reward of final answer.

Adaptation: Full training from scratch

Training Data:

MNNS, ProntoQA, ProsQA datasets
Supervision targets constructed by running search algorithms (e.g., finding all valid paths) and creating weighted distributions (alpha*) over visited states

Key Hyperparameters:

embedding_dimension: Varied {16, 24, 32} for analysis
trajectory_budget_B: Varied {1, 4, 8, 16, |T|}
MTS_samples_K: Varied (controls parallelism at inference)

Compute: Not reported in the paper

Comparison to Prior Work

vs. COCONUT: CoT2 uses convex combinations of vocabulary embeddings (interpretable as superposition) rather than raw hidden states; CoT2 uses explicit budget-constrained supervision from multiple traces.
vs. Discrete CoT: CoT2 delays commitment by maintaining continuous state; capable of parallel exploration in one trace.
vs. Self-consistency: CoT2 achieves similar benefits (error reduction) within a single trace by packing multiple paths into the embedding space.

Limitations

Requires task-specific supervision (oracle search trajectories) to construct the continuous targets for CSFT.
Performance is sensitive to embedding dimension; requires sufficient capacity (d > log(v/B)) to pack multiple states.
Currently demonstrated on synthetic/structured reasoning tasks (MNNS, logic graphs); scalability to open-ended natural language generation is discussed but not fully empirically proven.

Reproducibility

Code availability is not provided. Datasets (ProntoQA, ProsQA) are standard/existing. MNNS is a synthetic task defined in the paper. Hyperparameters for architecture (embedding dims) are specified.

📊 Experiments & Results

Evaluation Setup

Controlled reasoning tasks requiring multi-step lookahead and branching

Benchmarks:

MNNS (Minimum Non-Negative Sum) (Combinatorial optimization / Search) [New]
ProntoQA (Logical reasoning (graph reachability))
ProsQA (Logical reasoning (conditional reachability))

Metrics:

Accuracy (Exact Match of final answer)
Pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MNNS task results demonstrating the superiority of CoT2 over discrete baselines in search-heavy tasks.
MNNS	Accuracy	0.1	0.99	+0.89
MNNS	Accuracy	0.1	0.99	+0.89
MNNS	Accuracy	0.1	0.99	+0.89

Experiment Figures

Performance on MNNS task: (a) Pass@k comparison, (b) Accuracy vs Embedding Dimension, (c) Budget vs Embedding Dimension heatmap.

Convergence speed on MNNS task.

Main Takeaways

Optimal parallelism (supervision budget B) depends on embedding dimension d; a 'sweet spot' exists where d is large enough to support the superposition of B traces.
CoT2 consistently outperforms discrete CoT on tasks requiring search (MNNS, ProntoQA), effectively solving them with a single trace where discrete models fail.
The method acts as a 'single-trace self-consistency', where one continuous forward pass contains information equivalent to aggregating multiple discrete paths.
Reinforcement Learning (GRPO) further enhances CoT2, allowing the model to optimize its internal continuous representations beyond the initial supervised targets.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture and embedding spaces
Chain of Thought (CoT) prompting
Reinforcement Learning (Policy Optimization)
Probability simplices and convex combinations

Key Terms

CoT2: Chain of Thought with Continuous Tokens—a method where reasoning steps are dense vectors (weighted sums of embeddings) rather than single discrete tokens

MNNS: Minimum Non-Negative Sum—a task requiring assigning signs to numbers to minimize their non-negative sum, used as a proxy for search/planning capabilities

CSFT: Continuous Supervised Fine-Tuning—training the model to output a target probability distribution over tokens rather than a single ground-truth token

MTS: Multi-Token Sampling—an inference strategy that samples K discrete tokens and averages their embeddings to form a continuous token, controlling parallelism

superposition: Representing multiple distinct states simultaneously by taking a weighted sum of their embedding vectors

ProntoQA: A logical reasoning benchmark testing whether a target node is reachable from a start node in a graph via deductive steps

ProsQA: A logical reasoning benchmark similar to ProntoQA but asking which of two target nodes is reachable

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on relative performance of a group of outputs

COCONUT: A prior method (cited baseline) that replaces discrete tokens with the last hidden state of the LLM in a curriculum learning fashion

embedding dimension: The size (d) of the vector space used to represent tokens; determines the capacity for packing information in CoT2