Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning

📝 Paper Summary

Moral Reasoning in LLMs Prompt Engineering Chain-of-Thought Reasoning

Thought Experiments prompts language models to explore diverse reasoning paths via counterfactual questions and answers, significantly improving zero-shot moral reasoning over standard Chain-of-Thought.

Core Problem

Language models struggle with moral reasoning tasks (like MMLU Moral Scenarios) even when using standard reasoning techniques like Chain-of-Thought, which can actually degrade performance compared to direct answering.

Why it matters:

Aligning human values in AI is critical for responsible deployment, yet models perform poorly on socially relevant topics like morality and law
Standard linear reasoning paths (CoT) often fail on complex moral tasks that require exploring alternative possibilities
MMLU Moral Scenarios is one of the worst-performing tasks for many LLMs, leaving significant headroom for improvement

Concrete Example: In a scenario where the character cuts children's hair, standard reasoning might assume it's neutral. Thought Experiments asks 'Was it justified?' and 'Were the children happy?', revealing potential moral conflicts that linear reasoning misses.

Key Novelty

Thought Experiments Prompting

Uses a multi-step prompting framework that mimics human thought experiments by explicitly generating counterfactual questions about a scenario (e.g., 'What if X happened instead?')
Forces the model to answer these hypothetical questions to explore 'two sides of the coin' before converging on a final moral judgment
Introduces a 'Choose' step where the model selects the best explanation from multiple generated reasoning paths, recognizing that moral situations often have multiple valid interpretations

Evaluation Highlights

+9.06% to +16.26% accuracy improvement on MMLU Moral Scenarios using zero-shot Thought Experiments compared to direct zero-shot and CoT baselines
Standard Zero-shot CoT actually hurts performance (-3.91% vs direct zero-shot), while Thought Experiments reverses this trend
Achieves 80.45% accuracy with 5-shot Thought Experiments + self-consistency, the highest performance reported in the paper

Breakthrough Assessment

7/10

Significant improvement on a notoriously difficult task where standard CoT fails. The method is intuitive and effective, though tested on only one model/task so far.

⚙️ Technical Details

Problem Definition

Setting: Multiple-choice moral judgment task where a model must determine if a character's action in a scenario is morally wrong

Inputs: A pair of text scenarios describing actions taken by a main character

Outputs: A classification of each scenario as 'Wrong' or 'Not wrong' (from 4 multiple-choice options)

Pipeline Flow

Step 1: Pose Counterfactuals (Generate questions)
Step 2: Answer Counterfactuals (Discuss implications)
Step 3: Summarize (Synthesize thoughts)
Step 4: Choose (Select best summary)
Step 5: Answer (Final classification)

System Modules

Pose Counterfactuals (Counterfactual Generation)

Generate detailed moral counterfactual questions for the given scenario without answer options