Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models

📝 Paper Summary

Prompting Strategies Reasoning in LLMs Multimodal Reasoning

Graph-of-Thought models reasoning as a non-linear graph of connected ideas rather than a linear chain, fusing graph-encoded structural information with text and visual features for improved question answering.

Core Problem

Human thought is non-linear and jumping, but current Chain-of-Thought (CoT) approaches force LLMs into strict sequential reasoning, losing complex structural connections between ideas.

Why it matters:

Sequential chains fail to capture 'leaps of thought' where seemingly unrelated ideas connect to form solutions
Existing methods neglect the complex structural information inherent in reasoning (e.g., multiple premises leading to one conclusion)
Current multimodal approaches often treat reasoning linearly, missing the graph-like nature of human cognition

Concrete Example: In reasoning about an earthquake, a linear chain might say 'Earthquake -> shaking -> ground moves'. A graph approach captures that 'Earthquake' links to 'earth' and 'quake', which implies 'ground' and 'shake' respectively, and these converge to the final concept, modeling the deductive leap.

Key Novelty

Graph-of-Thought (GoT) Framework

Models thoughts as nodes in a graph (extracted via OpenIE) and connections as edges, rather than a linear sequence
Uses a specialized graph attention network to encode this 'thought graph' alongside standard text and vision encoders
Fuses the graph, text, and visual representations via a gated fusion mechanism to generate rationales and answers

Architecture

The overview of the Graph-of-Thought framework, detailing the two-stage process (Rationale Generation and Answer Generation) and the specific encoding modules.

Evaluation Highlights

+2.40% accuracy improvement over the strong Multimodal-CoT baseline on the ScienceQA test set using T5-base
Achieves 87.59% accuracy on ScienceQA (T5-base), surpassing the prior state-of-the-art
Outperforms ChatGPT by 9.28% on the ScienceQA benchmark

Breakthrough Assessment

7/10

Significant architectural innovation by explicitly encoding reasoning structure as a graph and fusing it with other modalities. Strong empirical results on ScienceQA, though primarily evaluated on T5-based models rather than the largest modern LLMs.

⚙️ Technical Details

Problem Definition

Setting: Multimodal (Text+Image) or Text-only Question Answering with rationale generation

Inputs: Context text, Question, Options, and optionally an Image

Outputs: Natural language rationale and final answer choice

Pipeline Flow

Group: GoT Construction (Text -> Thought Graph)
Group: Encoding (Text/Image/Graph -> Features)
Group: Fusion & Decoding (Features -> Rationale/Answer)

System Modules

GoT Constructor (ECC)

Converts input text into a structured thought graph

Model or implementation: OpenIE (Stanford) + Coreference Resolution (Stanford CoreNLP)

Text Encoder (Encoding)

Encodes the linear input text

Model or implementation: T5 Encoder

Vision Encoder (Encoding)

Encodes input image (if present)

Model or implementation: Readily available vision extraction model (e.g., DETR or ResNet, per Multimodal-CoT baseline)

GoT Encoder (Encoding)

Encodes the constructed thought graph structure

Model or implementation: Graph Attention Network (GAT) with multi-head attention

Fusion Layer (Fusion & Decoding)

Integrates Text, Vision, and Graph representations

Model or implementation: Gated Fusion Mechanism + Cross-Attention

Decoder (Fusion & Decoding)

Generates the output sequence

Model or implementation: T5 Decoder

Novel Architectural Elements

Integration of a dedicated GoT Encoder (GAT) into the multimodal encoder-decoder pipeline
Two-stage pipeline where the thought graph is constructed dynamically from input text (Stage 1) and input+rationale (Stage 2)
Use of Extract-Cluster-Coreference (ECC) to formalize 'thought units' into a graph structure for LLM processing

Modeling

Base Model: FLAN-Alpaca (T5-base and T5-large variants)

Training Method: Fine-tuning on task-specific datasets

Objective Functions:

Purpose: Minimize the difference between generated tokens and ground truth (standard Seq2Seq loss).

Formally: Standard cross-entropy loss for language modeling.

Adaptation: Fine-tuning of the full model including the new GAT and fusion parameters

Trainable Parameters: Not explicitly reported in the paper

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Multimodal-CoT: GoT adds a third modality (Graph) representing structural thought connections, whereas Multimodal-CoT only fuses Text and Vision.
vs. ToT: ToT uses a tree search for exploration; GoT models the thought process itself as a graph with arbitrary connections (cycles, converging paths) and uses a GAT encoder.
vs. CoT: GoT models non-linear connections rather than a single linear sequence.

Limitations

The ECC graph construction relies on external tools (OpenIE, CoreNLP), which may propagate errors or limit end-to-end differentiability.
Evaluation is limited to T5-based models (base/large), not tested on very large scale models like GPT-4 or PaLM (except as baselines).
No specific computational overhead or latency metrics reported for the graph construction/encoding steps.

Reproducibility

Code: https://github.com/Zoeyyao27/Graph-of-Thought

Code is publicly available at https://github.com/Zoeyyao27/Graph-of-Thought. The paper mentions using standard open-source tools (Stanford OpenIE, CoreNLP) for graph construction and FLAN-Alpaca (T5) as the backbone.

📊 Experiments & Results

Evaluation Setup

Two-stage reasoning (Rationale Generation -> Answer Generation) on ScienceQA and AQUA-RAT datasets.

Benchmarks:

ScienceQA (Multimodal Science Question Answering)
AQUA-RAT (Text-only Math Word Problems)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ScienceQA	Accuracy	85.19	87.59	+2.40
ScienceQA	Accuracy	78.31	87.59	+9.28
ScienceQA	Accuracy	88.40	87.59	-0.81
AQUA-RAT	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

A conceptual comparison between Chain-of-Thought (linear) and Graph-of-Thought (non-linear) reasoning.

An example of the deductive reasoning process formulated as a graph.

Main Takeaways

GoT consistently outperforms linear Chain-of-Thought (CoT) and Multimodal-CoT baselines, validating the benefit of modeling thoughts as graphs.
The two-stage framework (Rationale -> Answer) combined with graph encoding allows the model to leverage structural reasoning information effectively.
The approach works for both multimodal (ScienceQA) and text-only (AQUA-RAT) tasks, showing versatility.

📚 Prerequisite Knowledge

Prerequisites

Transformer Architecture (Encoder-Decoder)
Graph Neural Networks (specifically Graph Attention Networks)
Chain-of-Thought (CoT) Prompting
Open Information Extraction (OpenIE)

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

GoT: Graph-of-Thought—the proposed method modeling reasoning as a graph of connected thought units

ECC: Extract-Cluster-Coreference—the process used to construct the thought graph from text by extracting triplets, clustering them, and resolving coreferences

GAT: Graph Attention Network—a neural network architecture that operates on graph-structured data, using attention mechanisms to weigh the importance of neighboring nodes

OpenIE: Open Information Extraction—systems that extract structured relational tuples (usually subject-verb-object) from unstructured text

Gated Fusion: A mechanism to combine features from different modalities (text, image, graph) where a learned gate controls how much information flows from each source

T5: Text-to-Text Transfer Transformer—an encoder-decoder language model architecture

Multimodal-CoT: A baseline method that incorporates visual features into the Chain-of-Thought reasoning process