Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Learning

📝 Paper Summary

LLM Reasoning Frameworks Graph-based Prompting Neuro-symbolic Reasoning

L2T models LLM reasoning as an annotatable graph where a Graph Neural Network dynamically selects reasoning strategies and prompt parameters step-by-step without task-specific engineering.

Core Problem

Existing reasoning frameworks (CoT, ToT, GoT) rely on static, task-specific prompts and predefined structures, preventing models from adapting their strategy in real-time or generalizing across diverse tasks without manual design.

Why it matters:

High reliance on task-specific prompts limits generalization; complex reasoning requires precise, handcrafted prompts that fail when tasks change
Current methods (like Tree of Thoughts) cannot adjust model parameters (e.g., temperature) or branching factors dynamically during the reasoning process
Fine-tuning for specific reasoning tasks is computationally expensive and infeasible for API-only models

Concrete Example: In a Game of 24 puzzle, a static Chain-of-Thought prompt might force the model to linearly compute numbers even when a dead end is reached. L2T detects the dead end (Reasoning Stop label), backtracks, and adjusts the temperature/branching factor to explore new arithmetic combinations automatically.

Key Novelty

L2T (Learn to Think)

Models the reasoning process as a dynamic graph where an LLM classifies nodes (thoughts) to decide whether to continue, stop, or backtrack
Uses a trainable GNN 'Actor' to observe the reasoning graph's state and output actions that adjust prompt parameters (e.g., branching factor) and LLM hyperparameters (e.g., temperature) in real-time
Bootstraps itself by automatically generating task formats and evaluation criteria from the task description, removing the need for human-designed prompts

Architecture

The L2T framework pipeline showing the interaction between the LLM, the Reasoning Process Graph, and the GNN-based Reinforcement Learning module.

Evaluation Highlights

Achieved 100% success rate on 3x3 Sudoku and 98.46% on 4x4 Sudoku, outperforming Tree of Thoughts (92.31% / 72.31%)
Surpassed Chain-of-Thought (few-shot) by +50.08 points on Game of 24 (80.42% vs 30.34%)
Maintained high performance (98.46% on 4x4 Sudoku) even when task-specific prompts were removed, whereas ToT dropped to 34.62%

Breakthrough Assessment

8/10

Strong conceptual novelty in using GNNs to control LLM inference parameters dynamically. Significant performance gains on logic puzzles, especially in zero-shot/generalization settings.

⚙️ Technical Details

Problem Definition

Setting: Step-by-step logical reasoning where the process is modeled as a directed graph G=(V,E)

Inputs: Task description and initial problem state

Outputs: Final reasoning path and solution (node labeled as class 3)

Pipeline Flow

Initialization (Generate format/eval criteria)
Node Classification (LLM decides logic flow)
Mode Selection (GNN adjusts parameters)
Thought Generation (LLM produces next step)

System Modules

Initializer

Generate initial graph G(1), format constraints X_fmt, and evaluation criteria X_eva from task description

Model or implementation: GPT-4o (shared LLM)

Node Classifier (Reasoning Control)

Classify current thought nodes into 4 categories: (1) Stop, (2) Continue, (3) Final Result, (4) Backtrack

Model or implementation: GPT-4o (shared LLM)

Mode Selector (Reasoning Control)

Encode graph state and output action vector 'a' containing prompt parameters (branching) and LLM hyperparameters (temperature)

Model or implementation: GNN (1-layer GCN + 2-layer MLP)

Thought Generator

Generate next reasoning steps based on classification and selected mode parameters

Model or implementation: GPT-4o (shared LLM)

Novel Architectural Elements

GNN-based controller ('Mode Selector') that sits outside the LLM to dynamically adjust LLM inference parameters (temperature, branches) based on the topology of the reasoning graph
Auto-generated evaluation and formatting prompts (X_eva, X_fmt) extracted from task descriptions to remove human prompt engineering

Modeling

Base Model: GPT-4o (accessed via API)

Training Method: Reinforcement Learning (PPO) on the GNN module only (LLM is frozen)

Objective Functions:

Purpose: Optimize GNN policy to maximize reasoning success.

Formally: PPO clipped surrogate objective maximizing expected reward.
Purpose: Estimate value of graph states.

Formally: Critic loss (MSE between predicted value and actual return).

Key Hyperparameters:

learning_rate: 5e-3
ppo_clip_epsilon: 0.2
gradient_norm: 0.5
+ 2 more
path_hyperparameter_beta: 2
epochs: 20

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToT/GoT: L2T uses a learnable GNN controller to adjust search parameters (temperature, branches) dynamically, whereas ToT/GoT use fixed search strategies (BFS/DFS) and static parameters.
vs. Algorithm of Thoughts (AoT): AoT relies on in-context algorithmic examples; L2T generates its own formats and learns reasoning policies via RL.
vs. Auto-CoT: Auto-CoT generates static chains; L2T builds dynamic graphs with real-time feedback.

Limitations

Relies on the capabilities of the underlying LLM (GPT-4o) for node generation and evaluation.
GNN training adds a computational overhead compared to simple prompting strategies.
Reinforcement learning component requires task-specific interactions to converge (20 epochs used in experiments).
Experiments limited to logic/puzzle tasks (Sudoku, Game of 24, TruthQuest, Writing); broader domain applicability (e.g., coding, math) less explored.

Reproducibility

Code: https://github.com/zch65458525/L2T

Code is publicly available at https://github.com/zch65458525/L2T. Prompts are detailed in Appendix A.5. The method relies on GPT-4o API, which is closed-source.

📊 Experiments & Results

Evaluation Setup

Logic puzzles and creative generation tasks evaluated using GPT-4o

Benchmarks:

Sudoku (Logic Puzzle (3x3, 4x4, 5x5))
Game of 24 (Arithmetic Reasoning)
TruthQuest (Logical Deduction (Knights and Knaves))
Creative Writing (Constrained Text Generation) [New]

Metrics:

Success Rate (Accuracy)
Token Usage
LLM Access Counts
Statistical methodology: Reported mean and standard deviation across trials

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Sudoku puzzles shows L2T achieving near-perfect results on smaller grids and maintaining high performance on larger grids where baselines falter.
4x4 Sudoku	Success Rate	72.31	98.46	+26.15
4x4 Sudoku w/o TSP	Success Rate	34.62	98.46	+63.84
Results on Game of 24 demonstrate L2T's ability to handle arithmetic search spaces better than tree or graph baselines.
Game of 24	Success Rate	74.23	80.42	+6.19
Game of 24 w/o TSP	Success Rate	27.54	80.42	+52.88
Average across tasks	Tokens per Case	11600	4680	-6920
Game of 24	Accuracy	77.45	80.42	+2.97

Experiment Figures

Plots of Temperature and Top-p values chosen by the GNN across reasoning steps for two different tasks.

Main Takeaways

L2T consistently outperforms CoT, ToT, and GoT across logic and creative tasks, with the margin widening significantly when task-specific prompts are removed.
The GNN-based mode selector learns distinct strategies for different tasks (e.g., high temperature/low top-p for one task vs. direct correlation for another).
L2T is more token-efficient than ToT and GoT because the learned policy reduces unnecessary exploration steps (nodes generated).
The framework successfully automates the prompt engineering process, generating its own formats and evaluation criteria.

📚 Prerequisite Knowledge

Prerequisites

Graph Neural Networks (GNNs)
Reinforcement Learning (Actor-Critic, PPO)
Prompt Engineering (Chain/Tree/Graph of Thoughts)
Large Language Models (LLMs)

Key Terms

Reasoning Process Graph: A graph where nodes are LLM thoughts and edges are dependencies; nodes are classified to determine if reasoning should continue, stop, or backtrack

GNN: Graph Neural Network—a deep learning model that processes graph-structured data to produce node embeddings

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used here to train the GNN to select better reasoning modes

Actor-Critic: An RL architecture where the Actor decides actions (reasoning parameters) and the Critic estimates the value of the current graph state

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps

Tree of Thoughts (ToT): A framework allowing LLMs to explore multiple reasoning paths in a tree structure

Graph of Thoughts (GoT): A framework modeling reasoning as an arbitrary graph

Temperature: An LLM hyperparameter controlling randomness in generation

Branching factor: The number of new thought nodes generated from a single parent node