| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Factuality results show Llama-2-chat models often outperforming GPT models, particularly on counterfactuals, likely due to GPTs' tendency to follow instructions (sycophancy) rather than refute false premises. | ||||
| FFT | Accuracy | 0.170 | 0.585 | +0.415 |
| FFT | Accuracy | 0.509 | 0.645 | +0.136 |
| Fairness results indicate GPT models generally exhibit lower bias (lower CV) across demographics compared to open-source models. | ||||
| FFT | CV (Coefficient of Variation) | 0.655 | 0.177 | -0.478 |
| FFT | CV (Coefficient of Variation) | 0.457 | 0.000 | -0.457 |
| Toxicity results highlight the gap between utterance-level and context-level detection. | ||||
| FFT | Non-toxicity Score | 0.902 | 0.778 | -0.124 |
| FFT | Non-toxicity Score | 0.724 | 0.852 | +0.128 |