← Back to Paper List

Structured Chain-of-Thought Prompting for Code Generation

Jia Li, Ge Li, Yongming Li, Zhi Jin
Peking University
ACM Transactions on Software Engineering and Methodology (2023)
Reasoning Benchmark

📝 Paper Summary

Prompt Engineering Code Generation
SCoT prompting improves code generation by constraining LLMs to format intermediate reasoning steps using explicit program structures like loops and branches rather than unstructured natural language.
Core Problem
Standard Chain-of-Thought (CoT) prompting generates linear natural language reasoning, which fails to capture the non-linear, nested structural logic essential for valid source code.
Why it matters:
  • CoT is State-of-the-Art (SOTA) for reasoning but provides only marginal gains in code generation (e.g., +0.82% on HumanEval for ChatGPT)
  • Source code is inherently structured (syntax, nesting); reasoning about it in flat natural language creates ambiguity about variable scopes and control flow
Concrete Example: When asked to find the maximum number in a list, standard CoT might say 'Iterate through the list... Iterate through the list of lists', leaving loop scopes ambiguous. SCoT explicitly writes 'for _list in lists: for element in _list:', clarifying the nesting before coding.
Key Novelty
Structured Chain-of-Thought (SCoT)
  • Replaces free-form natural language reasoning with a 'SCoT' that enforces three program structures: Sequence, Branch (if/else), and Loop (for/while)
  • Requires the LLM to explicitly define an Input-Output structure (types/names) before starting the reasoning process
  • Treats the reasoning step as a structural skeleton (abstract syntax) that maps directly to the final code implementation
Evaluation Highlights
  • Outperforms standard CoT prompting by +7.35 absolute percentage points (+13.79% relative) on HumanEval Pass@1 using ChatGPT
  • Achieves higher code correctness than zero-shot, few-shot, and CoT baselines across three benchmarks (HumanEval, MBPP, MBCPP) and two models (ChatGPT, Codex)
  • Human evaluators rated SCoT-generated code 15.27% higher in correctness and 15.90% higher in maintainability compared to standard CoT
Breakthrough Assessment
7/10
Simple but highly effective prompting strategy that aligns LLM reasoning with the target domain (code). Significant gains without training, though heavily reliant on prompt engineering.
×