| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| P-CoT consistently improves performance across syllable counting, rhyming, and g2p tasks compared to baselines. | ||||
| PhonologyBench (Syllable Counting) | Exact Match Accuracy | 16.0 | 48.8 | +32.8 |
| PhonologyBench (Syllable Counting) | Exact Match Accuracy | 21.1 | 57.4 | +36.3 |
| PhonologyBench (Rhyme Generation - Common) | Success Rate | Not explicitly reported in the paper | Not explicitly reported in the paper | +52.0 |
| PhonologyBench (g2p - Low Frequency) | Accuracy | 35.5 | 65.5 | +30.0 |