← Back to Paper List

Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Ran Cheng
arXiv (2026)
Memory Reasoning

📝 Paper Summary

Continual Learning (CL) Catastrophic Forgetting
Catastrophic forgetting is inevitable for sequential state-based learners due to information bottlenecks, but can be eliminated by architectures with a high-capacity context channel (HyperNetworks) that regenerate parameters per task.
Core Problem
Sequential learning methods (like EWC or SI) inevitably overwrite past knowledge because they treat parameters as a single finite state that must satisfy all tasks simultaneously, creating an information bottleneck.
Why it matters:
  • Standard regularization methods (EWC, SI) fail catastrophically on simple benchmarks (Split-MNIST accuracy ~19%) despite theoretical claims
  • There is an 80-percentage-point performance gap between regularization methods and generation methods (HyperNetworks) that lacks a unified theoretical explanation
  • Current architectures often include context mechanisms that are structurally bypassed (ignored) by the optimizer, leading to 'silent' forgetting
Concrete Example: On Split-MNIST, EWC (Elastic Weight Consolidation) tries to protect important weights but achieves only 18.9% accuracy (barely above chance), effectively forgetting previous digits. In contrast, a HyperNetwork with fewer parameters achieves 98.8% accuracy by generating a fresh set of weights for each digit task based on a context signal.
Key Novelty
Context Channel Capacity ($C_{ctx}$) as the predictor of forgetting
  • Introduces $C_{ctx}$, the mutual information between an architecture's context signal and its generated parameters, proving that zero forgetting requires $C_{ctx} \geq$ Task Entropy
  • Establishes an 'Impossibility Triangle': Zero forgetting, online learning, and finite parameters cannot coexist for sequential state-based learners (Paradigm A)
  • Demonstrates that HyperNetworks (Paradigm C) bypass this triangle by treating parameters as function values generated from context rather than maintained state
Evaluation Highlights
  • HyperNetworks ($C_{ctx} \approx 1$) achieve 98.8% accuracy on Split-MNIST with ~0% forgetting, while EWC ($C_{ctx} = 0$) stays at 18.9% accuracy with ~97% forgetting
  • The proposed 'Wrong-Context Probing' (P5) reveals that CFlow (neural ODE method) has a P5 delta of 0.0, proving it ignores its context input despite high theoretical capacity
  • Gradient Context Encoder for CIFAR-10 closes the gap between task-incremental and class-incremental learning from 23.3pp down to 0.7pp
Breakthrough Assessment
9/10
Provides a unifying information-theoretic proof explaining why widespread methods (EWC/SI) fail and others succeed. The 'Impossibility Triangle' and $C_{ctx}$ metric offer a rigorous theoretical foundation that has been missing in Continual Learning.
×