iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

📝 Paper Summary

Chain of Thought (CoT) Reasoning Latent Space Planning Supervised Fine-Tuning

iCLP enables LLMs to reason more accurately by first generating compact, discrete latent plans (learned via a vector-quantized autoencoder) that guide subsequent natural language thought generation.

Core Problem

Explicit textual plans for LLM reasoning are often prone to hallucinations, lack generalization across diverse tasks, and are difficult to generate accurately.

Why it matters:

Current methods relying on explicit plans (like ReACT) struggle because specific textual instructions often fail to generalize to new problems.
Generating detailed explicit plans introduces errors and hallucinations that derail the subsequent reasoning process.
Human cognition uses subconscious, implicit patterns rather than always verbalizing explicit steps, a mechanism current LLMs lack.

Concrete Example: When solving a math problem, an explicit planner might generate a rigid text step like 'Use the Pythagorean theorem' which could be slightly off-context or hallucinated. In contrast, iCLP generates a latent token sequence representing a high-level abstract strategy (e.g., 'geometric decomposition') that flexibly guides the reasoning without committing to brittle text.

Key Novelty

Implicit Cognition Latent Planning (iCLP)

Distills explicit plans from correct reasoning traces, then compresses them into discrete latent codes using a vector-quantized autoencoder (VQ-VAE).
Treats planning as a 'subconscious' process: the LLM predicts these latent plan tokens first, which then condition the generation of the explicit chain-of-thought.
Decouples planning (latent space) from reasoning (language space), allowing the model to learn generalizable, compact reasoning patterns.

Architecture

The complete iCLP pipeline: (1) Explicit Plan Distillation from CoT, (2) Latent Plan Space Learning via VQ-VAE, and (3) Fine-tuning the LLM with Latent Plans.

Evaluation Highlights

Achieves competitive performance with GRPO (reinforcement learning) on MATH and CodeAlpaca using only supervised fine-tuning on small models like Qwen2.5-7B.
+10% average accuracy improvement over base models on out-of-domain datasets (AIME 2024, MATH-500) via cross-dataset generalization.
Reduces token cost by 10% on average compared to zero-shot CoT prompting while improving accuracy.

Breakthrough Assessment

8/10

Strong conceptual novelty in moving planning to a discrete latent space while keeping reasoning explicit. demonstrably improves generalization and efficiency without complex RL, offering a potent alternative to purely text-based planning.

⚙️ Technical Details

Problem Definition

Setting: Step-by-step reasoning where generation is conditioned on latent plans.

Inputs: Natural language question Q.

Outputs: A sequence of latent plans followed by the final reasoning chain and solution.

Pipeline Flow

Plan Distillation (Extract explicit plans from CoT)
Latent Space Learning (Train VQ-VAE to encode plans)
Inference (LLM generates Latent Plans → LLM generates CoT)

System Modules

Plan Distiller

Extracts explicit text plans from existing reasoning trajectories.

Model or implementation: Off-the-shelf LLM (e.g., DeepSeek-V3)

Plan Encoder (VQ-VAE)

Compresses explicit text plans into discrete latent codes.

Model or implementation: Encoder-only Transformer + VQ Codebook

Reasoning LLM

Generates latent plans followed by explicit reasoning steps.

Model or implementation: Qwen2.5-7B (Fine-tuned)

Novel Architectural Elements

Integration of VQ-VAE codebook indices as special tokens in the LLM vocabulary to enable 'latent planning' before text generation.
Hybrid generation pipeline where the model switches between latent space (for planning) and language space (for reasoning) within a single forward pass context.

Modeling

Base Model: Qwen2.5-7B

Training Method: Supervised Fine-Tuning (SFT) on mixed latent/text data

Objective Functions:

Purpose: Train VQ-VAE to reconstruct plans.

Formally: Loss = Reconstruction_Loss + ||sg[Encoder(p)] - Quantized||^2 + ||sg[Quantized] - Encoder(p)||^2 (Commitment Loss).
Purpose: Fine-tune LLM to generate latent plans and text.

Formally: Standard Next-Token Prediction (Cross-Entropy) on the sequence [Question, Latent_Plans, CoT].

Adaptation: Full fine-tuning of LLM + Extension of embedding layer for codebook tokens

Trainable Parameters: All parameters of the LLM; Encoder/Codebook/Decoder for the VQ-VAE phase.

Training Data:

Utilizes MATH and CodeAlpaca datasets.
Synthesizes U reasoning trajectories per question using an LLM to increase diversity.
Replaces explicit plan text with learned codebook indices for the final SFT dataset.

Key Hyperparameters:

codebook_size_K: Not explicitly reported in the paper
memory_tokens_L: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Reduces token cost by 10% compared to zero-shot CoT prompting (inference efficiency).

Comparison to Prior Work

vs. ReACT/AutoAct/PlaSma: iCLP uses discrete *latent* plans instead of explicit text plans to improve generalization and reduce hallucination.
vs. LaRS: iCLP uses VQ-VAE for discrete plan quantization allowing standard token-based LLM integration, whereas LaRS retrieves latent rationales.
vs. Coconut: iCLP separates planning (latent) and reasoning (language), whereas Coconut conducts the reasoning itself in latent space.
+ 1 more
vs. TokenAssorted [not cited in paper]: TokenAssorted abstracts initial steps, whereas iCLP specifically targets the *planning* phase for abstraction.

Limitations

Dependence on a distillation model (DeepSeek-V3) to generate initial explicit plans.
Requires training a separate VQ-VAE module before fine-tuning the LLM.
Specific hyperparameters (codebook size, memory length) are not detailed in the main text.
Effectiveness relies on the quality of the initial explicit plans distilled from CoTs.

Reproducibility

Code: https://github.com/AgenticFinLab/latent-planning

Code is publicly available at https://github.com/AgenticFinLab/latent-planning. The paper mentions using DeepSeek-V3 for data distillation and Qwen2.5-7B as the base model. Specific hyperparameters like learning rate, batch size, and codebook dimensions are not explicitly detailed in the text.

📊 Experiments & Results

Evaluation Setup

Evaluation on mathematical reasoning and code generation tasks.

Benchmarks:

MATH (Mathematical Reasoning)
CodeAlpaca (Code Generation / Instruction Tuning)
AIME 2024 (Mathematical Reasoning (OOD))
MATH-500 (Mathematical Reasoning (OOD))
HumanEval (Code Generation (OOD))
MBPP (Code Generation (OOD))

Metrics:

Accuracy (Pass@1)
Token Cost
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
iCLP demonstrates strong generalization on out-of-domain mathematical and code datasets compared to base models.
AIME 2024 + MATH-500	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+10% (average)
HumanEval + MBPP	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+9% (average)
General Reasoning	Token Cost	Not explicitly reported in the paper	Not explicitly reported in the paper	-10%

Experiment Figures

Visualization of plan embeddings and question similarities using t-SNE and heatmaps.

Main Takeaways

iCLP allows small models (Qwen2.5-7B) to compete with RL-based methods (GRPO) using only Supervised Fine-Tuning.
Latent plans generalize better across tasks than explicit plans, shown by strong zero-shot performance on OOD datasets (AIME, HumanEval).
The method improves efficiency by generating compact latent tokens instead of verbose textual plans.
Visualizations (t-SNE) confirm that latent plans cluster by task similarity, indicating the model learns abstract, reusable reasoning strategies.

📚 Prerequisite Knowledge

Prerequisites

Chain of Thought (CoT) prompting
Vector-Quantized Variational Autoencoders (VQ-VAE)
Transformer architecture (Encoder-Decoder)
Supervised Fine-Tuning (SFT)

Key Terms

CoT: Chain of Thought—a technique where LLMs generate intermediate reasoning steps before the final answer.

Latent Plans (LPs): Compact, discrete vector representations of reasoning instructions that exist in a hidden space rather than as natural language text.

VQ-VAE: Vector-Quantized Variational Autoencoder—a neural network that learns to compress high-dimensional data into discrete codebook indices.

Codebook: A fixed set of learned vector representations used in VQ-VAE to approximate continuous embeddings with discrete tokens.

Explicit Plans: Textual instructions describing what step to take next in a reasoning process (e.g., 'Calculate the derivative').

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used as a strong baseline in this paper.