MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Latent Space Reasoning Efficient LLM Inference

MarCos decouples reasoning from token generation by modeling thought as a Markov chain of continuous vectors, achieving faster inference and higher accuracy than token-based Chain-of-Thought.

Core Problem

Standard Chain-of-Thought forces models to 'think while speaking' token-by-token, creating an information bottleneck due to discretization and slowing down inference significantly.

Why it matters:

Autoregressive generation of thousands of CoT tokens is computationally expensive and slow.
Discretizing thoughts into tokens at every step causes information loss between reasoning states.
Forcing the model to generate tokens immediately limits the ability to plan or deliberate before outputting an answer.

Concrete Example: In a complex math problem, a standard LLM must output every intermediate calculation step as text tokens. This forces it to commit to a path word-by-word. MarCos instead updates internal continuous vectors (thoughts) multiple times without outputting text, only generating the final answer or requested intermediate steps when ready.

Key Novelty

Markov Chain of Continuous Thoughts (MarCos)

Separates 'thinking' (transitioning continuous latent states) from 'speaking' (decoding states into text), allowing the model to reason without generating tokens.
Models reasoning as a Conditional Hidden Markov Model (cHMM) where thoughts are hidden variables and text is the observable emission.
Introduces a randomness predictor that maps thoughts to a multimodal distribution, enabling step-level control over reasoning diversity rather than token-level sampling.

Architecture

The MarCos inference pipeline showing the separation of Understanding, Thinking, and Speaking phases.

Evaluation Highlights

+4.7% accuracy on GSM8K compared to token-based CoT (Qwen2.5-0.5B backbone) while being up to 15.7x faster.
+8.66% accuracy improvement over CoLaR (best previous continuous reasoning baseline) on GSM8K.
Demonstrates strong generalization with up to +4.68% gains on out-of-domain benchmarks SVAMP and MultiArith.

Breakthrough Assessment

9/10

First latent reasoning method to match or outperform explicit CoT accuracy while offering massive speedups. The decoupling of thought and language aligns better with human cognition and solves the CoT efficiency bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Multi-step reasoning where intermediate rationale y is modeled as a sequence of latent variables z before generating output

Inputs: Natural language question x of length n

Outputs: Answer or reasoning steps y

Pipeline Flow

Understander (Encodes question)
Thinker (Updates latent neurons iteratively)
Speaker (Decodes neurons to text; optional during intermediate steps)

System Modules

Understander

Converts input question into a sequence of feature vectors

Model or implementation: Transformer (Qwen2.5 architecture)

Thinker (Reasoning Core)

Updates continuous thoughts (neurons) via a bi-directional transformer

Model or implementation: Bi-directional Transformer

Randomness Predictor (Reasoning Core)

Predicts the distribution of the auxiliary random variable R_k to guide thought transitions

Model or implementation: 2-layer MLP

Speaker

Translates shallow neurons into discrete text sentences

Model or implementation: Transformer-based decoder

Novel Architectural Elements

Separation of 'Thinking' (latent transitions) and 'Speaking' (emission) modules
Introduction of auxiliary random variable R_k with sparsity constraints to model multimodal thought distributions
Split of latent neurons into 'Deep' (internal reasoning) and 'Shallow' (interface with speaker) groups

Modeling

Base Model: Qwen2.5-0.5B (structure used for Understander, Thinker, Speaker)

Training Method: Two-phase variational training scheme

Objective Functions:

Purpose: Maximize marginal probability of ground truth steps via ELBO.

Formally: L = L_reconstruction + L_KL + L_sparse
Purpose: Reconstruct ground truth text from latent thoughts.

Formally: L_reconstruction = -log P(y_k | Neu_k)
Purpose: Align predicted randomness prior with posterior inferred from ground truth.

Formally: L_KL = KL(q(R_k | y_k) || p(R_k | x, Neu_{k-1}))
Purpose: Encourage interpretable, mono-semantic dimensions in randomness variable.

Formally: L_sparse = lambda * ||R_k||_1

Training Data:

GSM8K-Aug (385K samples)
Trained on solution-based data (step-by-step solutions)

Key Hyperparameters:

parameter_count: Approx. 2B (0.5B x 4 modules)
sparsity_lambda: Not explicitly reported in the paper

Compute: 15.7x inference speedup compared to autoregressive CoT. Training hardware not reported.

Comparison to Prior Work

vs. Coconut: MarCos models dynamic, multimodal transitions via R_k, whereas Coconut uses static/Gaussian transitions.
vs. CoLaR: MarCos decouples thinking from speaking entirely, whereas CoLaR aligns latent states to token blocks.
vs. CoT-SFT: MarCos performs reasoning in continuous space without generating tokens, offering speedup.
+ 1 more
vs. Diffusion LLMs [not cited in paper]: MarCos resolves randomness in one pass via the latent variable R_k, offering a potential acceleration path for diffusion-based generation.

Limitations

Not compatible with existing token-by-token pre-trained models; requires training from scratch.
Currently evaluated on math benchmarks (GSM8K, SVAMP); applicability to general NLP tasks not yet fully explored.
Requires ground truth step-by-step solutions for the variational training scheme.
Limited to 0.5B parameter scale experiments due to computational resources.

Reproducibility

Code: https://anonymous.4open.science/r/MarCos/

Code available at https://anonymous.4open.science/r/MarCos/. Trained from scratch. Pre-training on massive scale data (like original Qwen) was not performed due to compute limits.

📊 Experiments & Results

Evaluation Setup

Math reasoning tasks requiring multi-step derivation.

Benchmarks:

GSM8K (Grade school math word problems)
SVAMP (Math word problems with varying wording/semantics)
MultiArith (Multi-step arithmetic reasoning)

Metrics:

Accuracy (Acc)
Test Time (Inference efficiency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrating MarCos superiority over continuous reasoning baselines and token-based CoT on GSM8K.
GSM8K	Accuracy	44.15	52.81	+8.66
GSM8K	Accuracy	48.07	52.81	+4.74
GSM8K	Speedup Factor	1.0	15.7	14.7
Out-of-domain generalization results on SVAMP and MultiArith.
SVAMP	Accuracy	46.25	50.93	+4.68
MultiArith	Accuracy	78.33	83.01	+4.68

Experiment Figures

Conceptual comparison between Traditional CoT and MarCos.

Main Takeaways

MarCos is the first continuous reasoning method to match and surpass explicit token-based CoT in accuracy.
The method achieves massive inference speedups (up to 15.7x) by bypassing autoregressive token generation for intermediate steps.
Disentangling 'thinking' (latent) and 'speaking' (text) allows for effective non-autoregressive decoding in the speaking stage.
Step-level randomness control (via R_k) allows for interpretable steering of reasoning properties like depth and sentence length.

📚 Prerequisite Knowledge

Prerequisites

Hidden Markov Models (HMMs)
Variational Autoencoders (VAEs) and ELBO
Transformer architecture
Chain-of-Thought (CoT) prompting

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer.

cHMM: Conditional Hidden Markov Model—a statistical model where the system transitions between hidden states (thoughts) conditioned on an input, and emits observable outputs (text).

ELBO: Evidence Lower Bound—a loss function used in variational inference to approximate the intractable posterior distribution of latent variables.

NAR: Non-Autoregressive—generating multiple tokens or outputs in parallel rather than one by one.

Latent Space: A continuous vector space where the model represents information (thoughts) internally, as opposed to the discrete space of tokens.

Sparsity Loss: A penalty term (L1 norm) encouraging the model to use only a few active dimensions in the latent representation, making the randomness more interpretable.

Teacher-Forcing: A training technique where the model is fed the ground-truth previous step as input, rather than its own generated prediction.