← Back to Paper List

Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation

Damien Sileo
Centre de Recherche en Informatique, Signal et Automatique de Lille
arXiv (2024)
Recommendation Memory Reasoning Benchmark

📝 Paper Summary

Long-context LLM evaluation Conversational recommendation
Current flagship LLMs fail to identify missing items in lists longer than 100 elements due to 'attention overflow,' causing them to repeat existing items despite correctly recognizing their presence in isolation.
Core Problem
LLMs degrade significantly when asked to generate items *absent* from a long input list (inductive reasoning), often hallucinating that existing items are missing (repetition).
Why it matters:
  • Critical for conversational recommender systems where users list watch history and expect *novel* suggestions, not repetitions
  • Reveals a specific failure mode in long-context reasoning: models can retrieve specific items ('needle in a haystack') but struggle to attend to the *entire* set to determine what is missing
  • Existing repetition penalties operate at the token level and cannot prevent semantic repetitions of whole items in long contexts
Concrete Example: When a user lists 200 watched movies released in 2022 and asks for a recommendation (a missing movie), Claude 3.5 Sonnet suggests movies already present in the user's list, failing to filter them out.
Key Novelty
Missing Item Prediction Task
  • Inverts standard 'needle-in-a-haystack' evaluation: instead of finding a specific item *in* the context, the model must find the only relevant item *not* in the context
  • demonstrates that while models can recognize if an item is present (contrastive task), they fail to use this global awareness during generation, leading to 'attention overflow'
Evaluation Highlights
  • Llama-3-8B-Instruct repetition rate on missing number prediction spikes from ~0% (short context) to >80% at 1024 items
  • Contrastive accuracy ('Is X in the list?') remains high (~75%) at 1024 items, proving the model *sees* the items but fails to use that information during generation
  • Fine-tuning Llama-3-8B improves in-domain accuracy but fails to generalize to larger item sets or different domains (e.g., movies)
Breakthrough Assessment
7/10
Identifies a distinct, cognitively interesting failure mode (inductive vs deductive) in long-context LLMs that standard retrieval benchmarks miss. The 'attention overflow' concept is a valuable diagnostic tool.
×