Evaluation Setup
Standard NLP benchmarks covering commonsense reasoning, math, coding, and recall
Benchmarks:
- MMLU (General Knowledge (5-shot))
- Hellaswag (Commonsense Reasoning (0-shot))
- GSM8K (Math Word Problems)
- SQuAD-C (Context-based Recall)
Metrics:
- Accuracy
- Throughput (token/sec)
- Cache Size (MB)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparisons against SOTA small language models show Hymba-1.5B superior performance and efficiency. |
| Average (MMLU, ARC, PIQA, Wino, Hella, SQuAD) |
Average Accuracy (%) |
59.74 |
61.06 |
+1.32
|
| Inference Efficiency |
Cache Size (MB) |
918 |
79 |
-839
|
| Inference Efficiency |
Throughput (token/sec) |
191 |
664 |
+473
|
| Recall Task |
Recall (%) |
19.23 |
49.90 |
+30.67
|
| GSM8K |
Accuracy |
44.4 |
56.4 |
+12.0
|
Main Takeaways
- Parallel fusion of Attention and SSM outperforms sequential stacking by allowing complementary processing of the same input
- Meta tokens effectively function as learned cache initialization, recovering performance lost by sliding window attention
- Global attention is only needed in a few layers (first, middle, last) to maintain high recall, allowing aggressive use of sliding window attention elsewhere
- Cross-layer KV sharing further reduces memory footprint without degrading performance