Efficient Reasoning with Hidden Thinking

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Efficient Reasoning Chain-of-Thought (CoT)

Heima accelerates MLLM reasoning by condensing verbose Chain-of-Thought steps into single hidden 'thinking tokens' while preserving accuracy and allowing optional textual reconstruction.

Core Problem

Chain-of-Thought reasoning improves performance but requires generating verbose intermediate text, significantly increasing inference costs and latency for large models.

Why it matters:

Massive parameter counts make generating long CoT sequences computationally expensive
Real-time applications require faster reasoning without sacrificing the accuracy benefits of CoT
Existing compression methods often degrade performance on complex reasoning tasks

Concrete Example: A standard CoT might generate 100+ tokens detailing 'Summary', 'Caption', and 'Reasoning' steps before answering. Heima condenses these entire steps into just 3 'thinking tokens' (<Thinking_of_Summary>, etc.), skipping the text generation.

Key Novelty

Latent Space Chain-of-Thought (Heima)

Encodes entire reasoning steps (like a summary or caption) into a single special 'thinking token' whose last hidden state captures the semantic content
Uses a progressive training strategy to gradually replace text steps with thinking tokens, ensuring the model learns compact latent representations
Decouples reasoning from generation: an Encoder performs fast latent reasoning, while a separate Decoder can translate those latent states back into text for interpretability

Architecture

The Heima framework comparing Standard CoT vs. Heima. It shows the Heima Encoder generating 'Thinking Tokens' (e.g., <Thinking_of_Summary>) instead of text, leading directly to the answer. It also shows the Heima Decoder taking the hidden state of these tokens to reconstruct the text.

Evaluation Highlights

Reduces generated tokens to as little as 6% of the original CoT volume while maintaining accuracy
Achieves comparable or better zero-shot accuracy on reasoning benchmarks compared to standard verbose MLLMs
Successfully reconstructs interpretable reasoning chains from hidden tokens using the Heima Decoder, verifying the semantic richness of the compressed states

Breakthrough Assessment

8/10

Significant efficiency gain (94% token reduction) without accuracy loss is a strong result. The ability to decode the latent state back into text addresses the black-box nature of latent reasoning.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning where visual input X_v and query X_q are processed to generate an answer Y_a, utilizing intermediate reasoning steps

Inputs: Image X_v, textual query X_q

Outputs: Final answer Y_a (directly from Heima Encoder) or reconstructed CoT trajectory (from Heima Decoder)

Pipeline Flow

Heima Encoder (Image + Query -> Thinking Tokens -> Answer)
Heima Decoder (Thinking Token Hidden State + Prompt -> Reconstructed Text CoT)

System Modules

Heima Encoder

Perform efficient reasoning by generating thinking tokens instead of text, then producing the final answer

Model or implementation: Fine-tuned MLLM

Heima Decoder

Reconstruct the verbose textual reasoning process from the hidden states of thinking tokens for interpretability

Model or implementation: Standard LLM (text-only)

Novel Architectural Elements

Replacement of variable-length textual reasoning chains with single 'thinking tokens' in the output sequence
Injection of Encoder's hidden states directly into the Decoder's embedding layer to bridge latent reasoning and textual reconstruction

Modeling

Base Model: MLLM (Encoder) and LLM (Decoder) - specific architectures not named in text snippet

Training Method: Supervised Fine-Tuning with Progressive Encoding

Objective Functions:

Purpose: Train the model to generate the correct answer and thinking tokens.

Formally: Standard next-token prediction loss on the modified dataset where text CoTs are replaced by thinking tokens.

Adaptation: Full fine-tuning (implied)

Training Data:

Original dataset: {Image, Query, [CoT_1, CoT_2...], Answer}
Heima dataset: {Image, Query, [<Thinking_1>, <Thinking_2>...], Answer}
Decoder dataset: {Hidden_State, Prompt, Query, Original_CoT_Text}

Compute: Heima Encoder is faster/more memory efficient due to reduced token generation

Comparison to Prior Work

vs. Standard CoT: Generates orders of magnitude fewer tokens (thinking tokens only)
vs. Latent CoT (Hao et al.): Applies to MLLMs (vs. GPT-2) and ensures generalizability rather than task-specific overfitting
vs. Contextual Compression: Focuses on compressing the *reasoning process* itself into latent states, not just the input context

Limitations

Requires training separate Decoders for each reasoning stage if interpretability is needed
Decoder does not see the image directly, relying solely on the hidden state's capacity to retain visual information
Progressive training strategy adds complexity to the fine-tuning process

Reproducibility

No specific code URL or repository provided in the text. The method relies on standard MLLM architectures and custom data formatting.

📊 Experiments & Results

Evaluation Setup

Zero-shot reasoning benchmarks on MLLMs

Benchmarks:

Not specifically named in snippet (Multimodal Reasoning)

Metrics:

Accuracy (Zero-shot)
Number of generated tokens (Efficiency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Efficiency results showing massive reduction in token usage.
Reasoning Benchmarks (aggregated)	Token Count	100% (relative)	6% (relative)	-94%
Performance results indicating accuracy is maintained or improved.
Reasoning Benchmarks (aggregated)	Zero-shot Accuracy	Standard Performance	Comparable/Superior	Positive or Neutral

Experiment Figures

The Progressive Encoding strategy. It visualizes the training stages where CoT steps are gradually replaced by thinking tokens one by one.

Input construction for the Heima Decoder. It illustrates how the hidden state H_<CoT> replaces the token embedding in the Decoder's input sequence.

Main Takeaways

Heima achieves a massive reduction in inference cost (tokens generated) without sacrificing reasoning accuracy.
The 'Thinking Tokens' successfully encapsulate complex multimodal reasoning information, as evidenced by the Decoder's ability to reconstruct the text.
The approach validates that explicit verbose text is not strictly necessary for the model to perform complex reasoning, provided the internal state is cultivated correctly.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) reasoning
Transformer architecture (hidden states, tokens)
Next-token prediction

Key Terms

CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before the final answer

MLLM: Multimodal Large Language Model—AI models capable of processing both text and visual inputs

Thinking Token: A special token (e.g., <Thinking_of_Reasoning>) introduced by Heima to represent an entire reasoning step in its hidden state, replacing verbose text

Latent Space: The internal vector representation of data within a neural network, as opposed to the explicit textual output

KV Cache: Key-Value Cache—a mechanism to store previous calculations in Transformers to speed up generation

Heima Encoder: The reasoning model that processes inputs and generates compact thinking tokens followed by the answer

Heima Decoder: A standard LLM trained to take the hidden state of a thinking token and reconstruct the original textual reasoning step

Progressive Encoding: A training strategy where the number of compressed CoT stages is gradually increased from 0 to K