| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Ablation study showing the impact of SFT and GRPO stages on multiple-choice accuracy. | ||||
| Temperament-Sensitive MCQ Benchmark | Accuracy | 42.0 | 78.5 | +36.5 |
| Temperament-Sensitive MCQ Benchmark | Accuracy | 71.5 | 78.5 | +7.0 |
| Human expert evaluation of generated advice quality. | ||||
| Expert Human Evaluation | Psychological Appropriateness (0-1) | 0.76 | 0.88 | +0.12 |
| Expert Human Evaluation | Caregiving Suitability (0-1) | 0.74 | 0.85 | +0.11 |