Representation Finetuning for Continual Learning

📝 Paper Summary

Continual Learning (CL) Parameter-Efficient Fine-Tuning (PEFT)

CoRe adapts pre-trained Vision Transformers to continuous data streams by intervening in the low-rank linear subspace of hidden representations rather than modifying model weights, explicitly controlling drift to prevent forgetting.

Core Problem

Traditional parameter-efficient fine-tuning (PEFT) methods update weights via black-box optimization, lacking explicit control over how representations change, which leads to catastrophic forgetting and sensitivity to domain shifts.

Why it matters:

Pre-trained models (like ViT) require adaptation for downstream tasks but full finetuning is parameter-inefficient and prone to forgetting previously learned information
Existing PEFT methods (Adapters, Prompts) operate in weight space without interpretability, making it difficult to prevent interference between sequential tasks
Real-world applications like autonomous systems require learning from non-stationary data streams without losing past knowledge, which current weight-based tuning struggles to balance

Concrete Example: If a model classifies a 'Samoyed' (new task) based on visual features similar to a 'spotted dog' (old task), weight-based tuning might overwrite the 'spotted dog' weights to accommodate the 'Samoyed'. CoRe instead applies a linear correction to the 'Samoyed' representation to align it with its true feature within a constrained subspace, preserving the original structure.

Key Novelty

Continual Representation Learning (CoRe)

Shifts the finetuning paradigm from weight space (updating parameters) to representation space (intervening on hidden activations)
Defines task-specific interventions within a low-rank linear subspace, governed by an orthogonality constraint to bound the magnitude of representation updates
Uses an explicit optimization objective to align the transformed representations with target features, rather than relying on implicit black-box weight optimization

Evaluation Highlights

Consistently outperforms representative PEFT methods (Adapter, Prompt, SSF) across Task-Incremental Learning benchmarks including fine-grained (Aircraft) and large-scale (SUN397) datasets
Demonstrates superior performance in Domain-Incremental Learning settings (CDDB, DomainNet), effectively handling domain shifts while maintaining class discriminability
Achieves state-of-the-art results in Class-Incremental Learning (CIFAR100, ImageNet-R), the most challenging setting where task IDs are unavailable

Breakthrough Assessment

7/10

First application of Representation Finetuning (ReFT) to Continual Learning. Theoretically grounded with bounds on representation drift, though the paper relies on established backbones.

⚙️ Technical Details

Problem Definition

Setting: Continual Learning (Task-Incremental, Domain-Incremental, and Class-Incremental)

Inputs: Sequence of tasks D_t = {(x, y)}, where input distributions and label spaces may shift over time

Outputs: Class predictions for samples from current and previously seen tasks

Pipeline Flow

Input Image -> Pre-trained ViT Backbone (Frozen)
Hidden Representations -> CoRe Intervention (Low-rank Linear Transformation)
Calibrated Representations -> Task-Specific Classifiers (Prototypes)

System Modules

ViT Backbone

Extracts initial semantic representations from input images

Model or implementation: ViT-B/16-IN21K or ViT-B/16-IN1K

CoRe Intervention

Modifies hidden representations to align with task targets within a constrained subspace

Model or implementation: Learnable projection matrices R, W, and bias b

Classifier

Maps calibrated representations to class probabilities

Model or implementation: Class-mean based prototypes (NCM)

Novel Architectural Elements

Replacement of weight-based adapter modules with representation-level intervention layers (ReFT) in a Vision Transformer for Continual Learning

Modeling

Base Model: ViT-B/16 (ImageNet-21K and ImageNet-1K pretrained)

Training Method: Stochastic Gradient Descent (SGD) on ReFT parameters

Objective Functions:

Purpose: Align the transformed representation with the target task representation.

Formally: Minimize L_align = || g_theta(e_b) - e_s ||^2
Purpose: Maintain the geometry of the intervention subspace to bound representation drift.

Formally: Minimize L_orth = || R_t^T R_t - I ||_F^2

Adaptation: Representation Finetuning (ReFT)

Trainable Parameters: Matrices R (projection), W (linear transform), vector b (bias) per layer/task

Training Data:

TIL: 11 datasets (e.g., CIFAR100 split into 10 tasks)
DIL: 4 datasets (e.g., DomainNet split by domain)
CIL: 7 datasets (e.g., ImageNet-R split by class groups)

Key Hyperparameters:

learning_rate: 0.05
scheduler: Cosine decay
weight_decay: 0.0005
+ 2 more
batch_size: 48
epochs: 20

Compute: Not reported in the paper

Comparison to Prior Work

vs. Adapter/Prompt/SSF: CoRe intervenes on representations with explicit geometric constraints (orthogonality) rather than optimizing weights via black-box backpropagation
vs. Original ReFT (LLM): CoRe adapts ReFT for Vision Transformers and introduces specific constraints for the Continual Learning setting (preventing forgetting)
vs. L2P/DualPrompt [not cited in paper]: CoRe uses subspace intervention rather than a pool of learnable prompts to handle task variety

Limitations

Dependency on the availability of a strong pre-trained backbone (ViT)
The paper text cuts off before displaying quantitative tables, preventing verification of specific numerical margins
Evaluation is limited to Vision Transformers; applicability to CNNs or other architectures is not explored

Reproducibility

The paper does not provide a link to code or released artifacts. It specifies hyperparameters (LR, batch size) and backbone models (ViT-B/16).

📊 Experiments & Results

Evaluation Setup

Three Continual Learning scenarios: Task-Incremental (TIL), Domain-Incremental (DIL), and Class-Incremental (CIL)

Benchmarks:

Aircraft, Caltech101, CIFAR100, DTD, EuroSAT, Flowers102, Food101, MNIST, OxfordPet, StanfordCars, SUN397 (Task-Incremental Learning (TIL))
CDDB, CORe50, DomainNet, OfficeHome (Domain-Incremental Learning (DIL))
CIFAR100, CUB200, ImageNet-A, ImageNet-R, ObjectNet, OmniBenchmark, VTAB (Class-Incremental Learning (CIL))

Metrics:

Average Accuracy (Avg)
Last Accuracy (Last)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The provided paper text describes tables (Table 1, 2, 3) but cuts off before listing the actual numeric values; therefore, specific quantitative results cannot be extracted.
Qualitatively, CoRe is reported to outperform Adapter, Prompt, and SSF baselines across TIL, DIL, and CIL settings.
In Task-Incremental Learning, CoRe effectively handles both fine-grained datasets (Aircraft) and large-scale scene datasets (SUN397).
In Domain-Incremental Learning, CoRe is claimed to better learn domain-invariant representations compared to baselines.
In Class-Incremental Learning, the method reportedly achieves state-of-the-art performance, suggesting robust mitigation of catastrophic forgetting.

📚 Prerequisite Knowledge

Prerequisites

Continual Learning (CL) scenarios (TIL, DIL, CIL)
Vision Transformers (ViT) architecture
Parameter-Efficient Fine-Tuning (PEFT)
Linear Algebra (Rank, Orthogonality, Singular Values)

Key Terms

CoRe: Continual Representation Learning—the proposed framework that finetunes models by intervening on hidden representations in a low-rank subspace

ReFT: Representation Finetuning—a method originally for LLMs that modifies hidden states via learned interventions rather than changing model weights

PEFT: Parameter-Efficient Fine-Tuning—methods like Adapters or LoRA that update only a small subset of parameters to adapt pre-trained models

Catastrophic Forgetting: The tendency of neural networks to lose previously learned knowledge when trained on new data

Subspace Intervention: Modifying a vector (representation) only within a specific lower-dimensional direction or plane defined by a projection matrix

Orthogonality Constraint: A mathematical restriction ensuring the projection matrix columns are perpendicular, preserving the geometry of the representation space

ViT: Vision Transformer—a neural network architecture for image processing based on the Transformer mechanism

TIL: Task-Incremental Learning—CL scenario where the task ID is provided during inference

DIL: Domain-Incremental Learning—CL scenario where the domain changes but classes remain the same; task ID is not provided

CIL: Class-Incremental Learning—CL scenario where new classes are added over time; task ID is not provided

Representation Drift: Unintended changes in the internal features of a model as it learns new tasks, leading to forgetting of old tasks