| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Automated evaluation across 12,432 suggestions shows consistent performance across model configurations. | ||||
| Security Copilot Sessions (Automated) | Overall Usefulness | 0.870 | 0.884 | +0.014 |
| Security Copilot Sessions (Automated) | Novelty | 0.905 | 0.933 | +0.028 |
| Manual expert evaluation reveals a significant quality gap in 'Extremely Useful' suggestions between full and hybrid models. | ||||
| Security Copilot Sessions (Manual) | Extremely Useful % | 53.1 | 75.0 | +21.9 |
| Security Copilot Sessions (Manual) | Not Useful % | 3.5 | 2.0 | -1.5 |