Evaluation Setup
Paired evaluation with fixed model weights/decoding, toggling ARACH on/off
Benchmarks:
- Language Modeling tasks (Next-token prediction)
- Cloze-style benchmarks (Fill-in-the-blank)
Metrics:
- Not explicitly reported in the paper snippet
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- ARACH yields consistent gains across language modeling and cloze-style benchmarks compared to the base model without the plugin (qualitative result from text)
- Attention analysis suggests ARACH successfully mitigates the 'attention sink' phenomenon, where models irrationally focus on early tokens
- The method demonstrates that internal computation can be effectively engineered at inference time without parameter updates, offering a third path distinct from prompting and fine-tuning