← Back to Paper List

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
Google DeepMind, Stanford University, University of California, Berkeley
arXiv (2023)
Reasoning Agent Benchmark

📝 Paper Summary

LLM Reasoning Code Generation Tool Use
Chain of Code improves reasoning by having LLMs write programs where algorithmic steps are executed by an interpreter and semantic steps are simulated by the LLM itself.
Core Problem
Chain of Thought struggles with precise calculations, while Program of Thoughts fails on semantic tasks that cannot be easily expressed in executable code (e.g., detecting sarcasm).
Why it matters:
  • Many real-world problems require mixing algorithmic precision (arithmetic, sorting) with semantic understanding (common sense, linguistics).
  • Forcing all reasoning into code limits the scope of questions an LLM can answer to those with strict API support.
  • Forcing all reasoning into natural language causes hallucinations on complex computations or logic tracking.
Concrete Example: Task: 'Count how many times the person was sarcastic.' A standard code approach fails because `is_sarcastic()` isn't a valid Python function. Chain of Code allows the LLM to write this function, catch the execution error, and simulate the return value (e.g., 'True') using the LLM itself.
Key Novelty
LMulator (Language Model-augmented Code Emulator)
  • Treats the LLM as a fallback interpreter: when Python code raises an exception (e.g., undefined function `detect_sarcasm`), the runtime hands the state to the LLM.
  • The LLM simulates the output of that specific line of code based on context, updates the program state, and hands control back to the Python interpreter.
  • Allows reasoning traces to freely interweave precise executable logic (loops, math) with 'pseudocode' semantic calls.
Evaluation Highlights
  • Achieves 84% on BIG-Bench Hard, outperforming Chain of Thought (72%) and setting a new state of the art.
  • Outperforms the best human raters on the algorithmic subset of BIG-Bench Hard.
  • Scales better than Chain of Thought: CoC improves performance even on smaller models (e.g., text-ada-001) where CoT often fails.
Breakthrough Assessment
9/10
Simple yet profound architectural shift. By legalizing 'broken' code via LLM simulation, it unifies the two dominant reasoning paradigms (natural language vs. code) into a single executable flow.
×