← Back to Paper List

Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning

Xiao Ma, Swaroop Mishra, Ahmad Beirami, Alex Beutel, Jilin Chen
Google Research
arXiv (2023)
Reasoning Benchmark

📝 Paper Summary

Moral Reasoning in LLMs Prompt Engineering Chain-of-Thought Reasoning
Thought Experiments prompts language models to explore diverse reasoning paths via counterfactual questions and answers, significantly improving zero-shot moral reasoning over standard Chain-of-Thought.
Core Problem
Language models struggle with moral reasoning tasks (like MMLU Moral Scenarios) even when using standard reasoning techniques like Chain-of-Thought, which can actually degrade performance compared to direct answering.
Why it matters:
  • Aligning human values in AI is critical for responsible deployment, yet models perform poorly on socially relevant topics like morality and law
  • Standard linear reasoning paths (CoT) often fail on complex moral tasks that require exploring alternative possibilities
  • MMLU Moral Scenarios is one of the worst-performing tasks for many LLMs, leaving significant headroom for improvement
Concrete Example: In a scenario where the character cuts children's hair, standard reasoning might assume it's neutral. Thought Experiments asks 'Was it justified?' and 'Were the children happy?', revealing potential moral conflicts that linear reasoning misses.
Key Novelty
Thought Experiments Prompting
  • Uses a multi-step prompting framework that mimics human thought experiments by explicitly generating counterfactual questions about a scenario (e.g., 'What if X happened instead?')
  • Forces the model to answer these hypothetical questions to explore 'two sides of the coin' before converging on a final moral judgment
  • Introduces a 'Choose' step where the model selects the best explanation from multiple generated reasoning paths, recognizing that moral situations often have multiple valid interpretations
Evaluation Highlights
  • +9.06% to +16.26% accuracy improvement on MMLU Moral Scenarios using zero-shot Thought Experiments compared to direct zero-shot and CoT baselines
  • Standard Zero-shot CoT actually hurts performance (-3.91% vs direct zero-shot), while Thought Experiments reverses this trend
  • Achieves 80.45% accuracy with 5-shot Thought Experiments + self-consistency, the highest performance reported in the paper
Breakthrough Assessment
7/10
Significant improvement on a notoriously difficult task where standard CoT fails. The method is intuitive and effective, though tested on only one model/task so far.
×