Evaluation Setup
Zero-shot and Few-shot prompting on the MMLU Moral Scenarios dataset
Benchmarks:
- MMLU (Moral Scenarios subtask) (Multiple choice moral judgment)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Zero-shot experiments demonstrate that standard CoT hurts performance on this task, while Thought Experiments provides significant gains. |
| MMLU Moral Scenarios |
Accuracy |
57.09 |
66.15 |
+9.06
|
| MMLU Moral Scenarios |
Accuracy |
53.18 |
66.15 |
+12.97
|
| MMLU Moral Scenarios |
Accuracy |
50.00 |
66.26 |
+16.26
|
| Few-shot experiments show that human demonstrations further improve performance, though the gap between CoT and Thought Experiments narrows. |
| MMLU Moral Scenarios |
Accuracy |
78.55 |
80.45 |
+1.90
|
Main Takeaways
- Zero-shot Chain-of-Thought fails on moral reasoning (worse than direct answering), suggesting linear reasoning isn't sufficient for moral nuance
- Thought Experiments prompting successfully elicits counterfactuals that uncover hidden moral conflicts, significantly boosting zero-shot performance
- Self-consistency generally hurts zero-shot performance on this task for baselines, but helps slightly with Thought Experiments
- Few-shot demonstrations (5 examples) yield the best absolute performance (up to 80.45%), but require manual effort to craft counterfactual reasoning traces