| Benchmark | Metric | Baseline | This Paper | ฮ |
|---|---|---|---|---|
| Human evaluation on open-domain dialogue shows SeeKeR outperforming baselines in consistency and knowledge while reducing factual errors. | ||||
| Wizard of Internet (Human Eval) | Consistent | 65.06 | 78.47 | +13.41 |
| Wizard of Internet (Human Eval) | Factually Incorrect | 4.21 | 3.94 | -0.27 |
| Wizard of Internet (Human Eval) | Knowledgeable | 27.88 | 46.49 | +18.61 |
| Wizard of Internet (Human Eval) | Per-Turn Engaging | 83.52 | 90.41 | +6.89 |
| Comparison on Topical Prompts (Jan 2022) showing ability to handle new information compared to frozen LMs. | ||||
| Topical Prompts | True | 14 | 43 | +29 |
| Topical Prompts | Hallucination | 73 | 58 | -15 |
| Topical Prompts | Hallucination | 62 | 58 | -4 |
| Topical Prompts | Topical | 4 | 15 | +11 |