Evaluation Setup
Code generation benchmarks evaluated on functional correctness
Benchmarks:
- HumanEval (Python coding problems)
- MHPP (Mostly Hard Python Problems)
Metrics:
- PassRate (Pass@k implied)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MHPP |
PassRate |
Not reported in the paper |
Not reported in the paper |
+6.1%
|
| HumanEval |
PassRate |
Not reported in the paper |
Not reported in the paper |
+3.5%
|
Main Takeaways
- UnCert-CoT achieves up to 6.1% improvement on MHPP, indicating it is particularly effective on harder problems where baselines struggle
- The method is robust across different model families (DeepSeek, CodeLlama, Qwen), suggesting the 'overthinking' problem and the uncertainty solution are model-agnostic
- By selectively applying CoT, the method aims to preserve efficiency for simple code lines while allocating compute to complex logic