PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

📝 Paper Summary

Autonomous Driving Prompt Engineering Multi-Modal Large Language Models (MLLMs)

PKRD-CoT is a zero-shot prompt framework that guides Multi-Modal Large Language Models through perception, knowledge, reasoning, and decision-making steps to improve autonomous driving performance without training.

Core Problem

Training end-to-end autonomous driving models is costly and complex, while standard data-driven approaches suffer from poor generalization and lack of interpretability.

Why it matters:

Fine-tuning large MLLMs (Multi-Modal Large Language Models) requires substantial computational resources and data
Existing data-driven driving models often struggle with long-tail scenarios (rare events) and lack the reasoning transparency of human drivers
Generic prompting techniques fail to leverage the specific cognitive steps (perception, knowledge, reasoning) required for safe driving

Concrete Example: In a scenario where a car should maintain speed, a standard zero-shot prompt might incorrectly advise stopping due to a lack of spatial reasoning, whereas PKRD-CoT correctly identifies the safe distance and decides to maintain speed.

Key Novelty

PKRD-CoT (Perception, Knowledge, Reasoning, and Decision-making Chain-of-Thought)

Structured zero-shot prompt framework that forces the MLLM to mimic human driving cognition in four explicit steps
Integrates a 'Memory' module within the prompt to store environmental context in structured JSON format, mitigating context loss in language models
Uses a knowledge-driven approach to interpret traffic scenarios (e.g., red light -> stop) without requiring fine-tuning on driving datasets

Architecture

The PKRD-CoT framework structure mapping autonomous driving capabilities to prompt steps

Evaluation Highlights

PKRD-CoT improves decision-making accuracy by 22% compared to standard zero-shot prompts in ablation studies
GPT-4.0 achieves 100% accuracy in mathematical reasoning tasks (calculating vehicle distances) using the framework
Claude and LLava1.6 achieve high average perceptual accuracies of 94% and 92% respectively on autonomous driving tasks

Breakthrough Assessment

7/10

A strong application of Chain-of-Thought to a specific domain (autonomous driving) with clear performance gains. While not a new model architecture, it effectively bridges MLLMs and control tasks via structured prompting.

⚙️ Technical Details

Problem Definition

Setting: End-to-end autonomous driving decision-making using pre-trained MLLMs via prompt engineering

Inputs: Panoramic view of the car's environment (merged from 6 cameras) + PKRD-CoT text prompt

Outputs: Driving decision (speed up, speed down, stop, keep remain, or change lane) + reasoning analysis

Pipeline Flow

Image Preprocessing (Merge 6 cameras into panoramic view)
Prompt Injection (PKRD-CoT template)
Inference Step 1: Observation (Perception)
Inference Step 2: Identification (Knowledge)
Inference Step 3: Memory (Context Storage)
Inference Step 4: Decision (Action Selection)

System Modules

Image Preprocessor

Merge images from six cameras (front/back/left/right) into a unified front and back panoramic view

Model or implementation: Image processing script

PKRD-CoT Prompt Engine

Guide the MLLM through the 4-step reasoning process (Observation, Identification, Memory, Decision)

Model or implementation: Prompt Template

MLLM Backbone

Execute the chain-of-thought to analyze the scene and output a driving decision

Model or implementation: Various (GPT-4.0, Claude, LLaVA1.6, etc.)

Novel Architectural Elements

Explicit mapping of autonomous driving capabilities (Perception, Knowledge, Reasoning, Decision) to Chain-of-Thought prompt stages
Integration of a 'Memory' prompt step that forces the model to output environmental understanding in a structured JSON format to maintain context

Modeling

Base Model: Evaluated multiple models: GPT-4.0, Claude, LLaVA1.6, Qwen-VL-Plus, CogVLM chat, MiniGPT-4

Training Method: Zero-shot Prompt Engineering (Inference only)

Adaptation: None (Prompting only)

Trainable Parameters: 0

Compute: Not reported in the paper (Inference-only study)

Comparison to Prior Work

vs. DriveMLM: PKRD-CoT focuses on zero-shot prompting without the heavy training/fine-tuning pipeline of DriveMLM
vs. Standard Zero-Shot: Adds structured 'Memory' and domain-specific steps (Perception/Knowledge) rather than generic 'think step by step'
vs. Role-Playing Prompts: Outperforms role-playing by forcing explicit reasoning steps rather than just adopting a persona

Limitations

Memory module is limited to the context window of the prompt; not long-term storage
MiniGPT-4 performs poorly on mathematical reasoning tasks (0% accuracy)
Traffic light recognition varies significantly across open-source models due to lighting conditions
Reliance on closed-source models (GPT-4) for best performance limits accessibility

📊 Experiments & Results

Evaluation Setup

Evaluation on subset of NuScenes dataset (real-world) and highway simulation. Tasks include object detection, scene understanding, and driving decision making.

Benchmarks:

NuScenes Dataset (Autonomous Driving Perception and Decision Making)

Metrics:

Perceptual Accuracy (correct identification of target species)
Decision Accuracy (correct driving action selection)
Mathematical Accuracy (correct calculation of vehicle distance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
NuScenes	Average Perceptual Accuracy	94	92	-2
NuScenes	Car Recognition Accuracy	100	100	0
NuScenes	Mathematical Accuracy	0	100	+100
NuScenes	Mathematical Accuracy	100	100	0

Experiment Figures

Comparison of decision-making outputs between Zero-shot, Role-playing, and PKRD-CoT prompts

Main Takeaways

PKRD-CoT significantly improves decision-making accuracy (22% over zero-shot, 6% over role-playing), validating the importance of structured reasoning in AD tasks
GPT-4.0 demonstrates the most robust performance across all dimensions, particularly in mathematical reasoning where smaller models like MiniGPT-4 fail completely
Open-source models like Qwen-VL-Plus show competitive performance in perception and reasoning, though some struggle with specific targets like traffic lights
The 'Knowledge' component allows models to infer actions from static signs (e.g., Red Light implies Stop) without explicit training, mimicking human driver logic

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-Modal Large Language Models (MLLMs)
Basics of autonomous driving stacks (Perception -> Planning -> Control)
Chain-of-Thought (CoT) prompting techniques

Key Terms

PKRD-CoT: Perception, Knowledge, Reasoning, and Decision-making Chain-of-Thought—the proposed prompt framework mimicking human driving steps

Zero-shot-CoT: A prompting technique where the model is asked to 'think step by step' without seeing any training examples

MLLM: Multi-Modal Large Language Model—AI capable of processing and generating both text and images

NuScenes: A large-scale public dataset for autonomous driving featuring diverse urban driving scenarios

Perceptual Accuracy: A metric measuring the model's ability to correctly identify and describe target objects (cars, pedestrians, traffic lights) in the scene

Pythagorean theorem: Used here to calculate the distance between vehicles based on pixel or spatial coordinates to test mathematical reasoning