| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Scaling experiments on GPT-2 architecture show ConceptLM outperforms parameter-matched baselines across various scales. | ||||
| Average (9 tasks) | Accuracy | 47.7 | 48.2 | +0.5 |
| Lambada (OpenAI) | Perplexity | 12.87 | 11.13 | -1.74 |
| Continual pre-training on Llama-3.1-8B demonstrates NCP's effectiveness on large, pre-trained models. | ||||
| MMLU | Accuracy | 66.3 | 66.5 | +0.2 |
| ARC-Challenge | Accuracy | 57.7 | 58.2 | +0.5 |
| Average (Downstream) | Accuracy | 34.0 | 35.3 | +1.3 |