Routing without Forgetting

📝 Paper Summary

Online Continual Learning (OCL) Vision Transformers (ViT)

RwF replaces static task-specific prompts with an energy-based associative memory that dynamically routes representations within the transformer backbone based on input features in a single forward pass.

Core Problem

In Online Continual Learning, data arrives in a stream and is seen only once. Standard parameter-efficient methods rely on iterative gradient updates to specialize prompts, which is too slow and reactive for non-stationary streams.

Why it matters:

Real-world systems (e.g., robotics, embedded vision) must adapt to changing environments immediately without offline retraining cycles.
Current prompt-based methods suffer from a lag in adaptation: they must 'learn' to route via optimization, leading to errors during the transition period between tasks.

Concrete Example: When a data stream shifts from 'birds' to 'cars', a standard prompt-tuning model must update its selection weights via gradient descent over multiple batches to switch contexts. During this lag, it misclassifies cars using bird-specific features. RwF calculates the 'car' routing compatibility immediately in the first forward pass.

Key Novelty

Energy-Based Associative Routing

Reinterprets transformer adaptation as a routing problem where the model selects representational subspaces dynamically.
Uses Hopfield Pooling layers to generate input-conditioned prompts via closed-form energy minimization, rather than retrieving static learned prompts from a pool.
Decouples routing decisions (instant, per-sample) from parameter updates (slow, gradient-based), allowing immediate adaptation to distribution shifts.

Architecture

The RwF transformer layer architecture.

Evaluation Highlights

Achieves 74.09% final average accuracy on Split-ImageNet-R under strict single-pass online settings.
Achieves 61.37% final average accuracy on Split-ImageNet-S, outperforming prompt-based baselines in scalability.
Introduces only 2.1% additional trainable parameters while maintaining robust performance in few-shot regimes.

Breakthrough Assessment

8/10

Offers a structurally novel solution to the plasticity-stability dilemma by using associative memory for routing. Strong empirical results on difficult ImageNet benchmarks justify the high score, though code availability is unclear.

⚙️ Technical Details

Problem Definition

Setting: Online Class-Incremental Continual Learning (OCL) where a model trains on a sequence of tasks with disjoint classes, observing each mini-batch only once.

Inputs: A stream of mini-batches B_t from task distribution D_t

Outputs: Class predictions over the union of all classes observed so far

Pipeline Flow

Input Embedding (Patches)
RwF Transformer Block (repeated L times)
Classification Head

System Modules

Projection Layers (Associative Routing)

Project input embeddings into keys (K) and values (V) for retrieval.

Model or implementation: Linear Layers (W_K, W_V frozen; W_Q learnable)

Hopfield Pooling (Associative Routing)

Generate dynamic routing prompts via energy minimization (softmax attention).

Model or implementation: Associative Operator H

Self-Attention Block

Update token representations using both original tokens and retrieved routing prompts.

Model or implementation: Standard ViT Self-Attention

Novel Architectural Elements

Embedding of HopfieldPooling layers directly within the transformer backbone to perform many-to-few associative retrieval.
Input-conditioned prompt generation mechanism that recomputes prompts per forward pass rather than retrieving stored static parameters.
Discarding processed prompts after each layer to ensure routing remains strictly input-driven and stateless across tasks.

Modeling

Base Model: Vision Transformer (ViT)

Training Method: Online Continual Learning (gradient descent on stream)

Objective Functions:

Purpose: Minimize classification error on the current task stream.

Formally: Standard Cross-Entropy Loss.

Adaptation: Updates only the specific RwF routing parameters (queries) and classifier head; backbone remains largely frozen.

Trainable Parameters: 2.1% of total model parameters

Training Data:

Split-CIFAR100: 10 tasks, 10 classes each
Split-ImageNet-R: 10 tasks, 20 classes each
Split-ImageNet-S: 10 tasks, 100 classes each

Key Hyperparameters:

beta: Inverse temperature for softmax (scaling factor)

Compute: Not reported in the paper

Comparison to Prior Work

vs. L2P/DualPrompt: RwF generates prompts dynamically via associative retrieval over the *current input sequence* rather than selecting fixed prompts from a learned pool.
vs. LoRA: RwF focuses on routing representations via energy minimization rather than modifying weight matrices directly.
vs. Mixture-of-Experts (MoE) [not cited in paper]: RwF achieves input-dependent routing via continuous associative memory without discrete expert partitioning or explicit gating networks.

Limitations

Relies on a frozen projection basis (W_K, W_V) which might limit adaptation capacity if the pre-trained features are insufficient.
Strict online constraint means no revisiting data, which fundamentally limits upper-bound performance compared to replay-based methods.
Performance depends on the depth at which RwF layers are inserted (ablation study finding).

Reproducibility

Code availability is not explicitly provided in the paper text. The method relies on standard ViT backbones and Hopfield layer definitions.

📊 Experiments & Results

Evaluation Setup

Online Class-Incremental Learning (single pass through data stream)

Benchmarks:

Split-ImageNet-R (Class-Incremental Image Classification (Object Recognition))
Split-ImageNet-S (Class-Incremental Image Classification (Large Scale))
Split-CIFAR-100 (Class-Incremental Image Classification)

Metrics:

Final Average Accuracy (A_Final)
Forgetting (measure of performance drop on past tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
All Benchmarks	Trainable Parameters (%)	100.0	2.1	-97.9

Main Takeaways

RwF achieves 74.09% accuracy on Split-ImageNet-R and 61.37% on Split-ImageNet-S, which the authors claim outperforms prior prompt-based and LoRA-based approaches by a large margin (exact baseline numbers not extracted from text).
The method is robust in few-shot regimes, maintaining performance as training samples decrease, unlike gradient-dependent methods that degrade faster.
Scalability is demonstrated on long task sequences (up to 40 tasks), where RwF preserves its advantage over baselines.
Ablation studies show that inserting RwF layers in early blocks is effective, confirming the importance of early-stage routing modulation.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformers (ViT) architecture
Online Continual Learning constraints (single pass)
Modern Hopfield Networks (associative memory)
Parameter-Efficient Fine-Tuning (PEFT)

Key Terms

OCL: Online Continual Learning—a training regime where data arrives in a stream and each sample is processed only once (no multiple epochs).

ViT: Vision Transformer—a deep learning model for image processing based on the attention mechanism.

RwF: Routing without Forgetting—the proposed architecture using energy-based routing.

HopfieldPooling: A layer type based on Modern Hopfield Networks that compresses a sequence of inputs into a summary (or prompts) via associative retrieval.

Energy-Based Model: A framework where inference is viewed as minimizing a scalar energy function; here, the attention mechanism is the energy minimization step.

Gibbs distribution: The probability distribution that minimizes free energy; in this paper, the softmax attention weights represent this distribution.

Catastrophic Forgetting: The tendency of neural networks to drastically lose performance on previously learned tasks when trained on new ones.

LoRA: Low-Rank Adaptation—a technique to fine-tune models by updating only low-rank decomposition matrices.

Class-IL: Class-Incremental Learning—a setting where the model must distinguish between all classes seen so far without knowing the task ID.