Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

📝 Paper Summary

Open-World Visual Recognition Safety AI

The emergence of Vision Language Models has merged five previously distinct open-world recognition tasks into two primary active fields—Sensory Anomaly Detection and OOD Detection—rendering Open Set Recognition and Novelty Detection largely redundant.

Core Problem

Prior to VLMs, tasks like Anomaly Detection, Novelty Detection, and Open Set Recognition had confusing overlaps in definition; VLMs like CLIP have blurred these boundaries further, leaving researchers unsure which problem settings remain non-trivial.

Why it matters:

Researchers are wasting effort on fields like Open Set Recognition (OSR) which have become conceptually redundant in the VLM era
Subtle definitional differences between 5 different sub-fields (AD, ND, OSR, OD, OOD) cause confusion and prevent unified progress
The capabilities of CLIP (e.g., zero-shot classification) solve some previous 'hard' problems (like semantic novelty detection) trivially, requiring a shift in research focus

Concrete Example: Previously, Open Set Recognition (OSR) was distinguished from OOD detection by specific benchmarks (e.g., splitting CIFAR-10). However, VLM-based OOD detection now uses identical setups (e.g., ImageNet-10 as ID), effectively merging OSR into OOD detection, yet some researchers still treat them as separate.

Key Novelty

Generalized OOD Detection v2 Framework

Proposes a new taxonomy that re-evaluates five fields (AD, ND, OSR, OOD, OD) specifically through the lens of VLM capabilities
Identifies that the field has consolidated: Semantic AD/ND and OSR are becoming inactive or integrated, while Sensory AD and OOD Detection remain the distinct, demanding challenges
Categorizes tasks based on distribution shift type (covariate vs. semantic) and the necessity of ID classification

Architecture

The evolution of the five open-world problems (Sensory AD, Semantic AD/ND, OSR, OOD Detection, OD) into the proposed 'Generalized OOD Detection v2' framework.

Evaluation Highlights

OOD Detection remains highly active with 26 top-venue papers (e.g., NeurIPS, CVPR) identified between 2021 and 2025
Sensory AD has grown significantly with 22 top-venue papers, distinguishing it as a key surviving challenge in the VLM era
Open Set Recognition (OSR) has declined to near-obsolescence with only 1 top-venue paper in the VLM era, indicating its integration into OOD detection

Breakthrough Assessment

9/10

This survey provides a much-needed structural reset for a confused field. By declaring specific sub-fields (like OSR) effectively 'dead' or integrated, it guides future research efficiency significantly.

⚙️ Technical Details

Problem Definition

Setting: Taxonomical categorization of open-world recognition tasks under the paradigm of Vision Language Models (VLMs)

Inputs: Five historical problem settings: Anomaly Detection (AD), Novelty Detection (ND), Open Set Recognition (OSR), OOD Detection, Outlier Detection (OD)

Outputs: A unified 'Generalized OOD Detection v2' framework with two primary active streams: VLM-based AD and VLM-based OOD Detection

Pipeline Flow

Sensory AD (Covariate Shift) -> Remains Active
Semantic AD/ND (Semantic Shift) -> Becomes Inactive
OSR (Multi-class ID + Detection) -> Integrated into OOD Detection
OOD Detection (Multi-class ID + Detection) -> Remains Active
OD (Transductive) -> Becomes Inactive

System Modules

Sensory AD (Active Fields)

Detects covariate shift (defects/anomalies) within a single class

Model or implementation: VLMs (e.g., CLIP)

OOD Detection (Active Fields)

Detects semantic shift (new classes) while classifying ID samples

Model or implementation: VLMs (e.g., CLIP)

Semantic AD/ND

Detects new classes (historically separate from OOD)

Model or implementation: VLMs (e.g., CLIP)

Novel Architectural Elements

Generalized OOD Detection v2: A conceptual architecture that merges OSR into OOD Detection and deprecates Semantic AD/ND based on VLM capabilities

Modeling

Base Model: CLIP (Contrastive Language-Image Pre-training)

Training Method: Survey methodology (no model training performed in this paper)

Adaptation: None (Survey paper)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Generalized OOD Detection v1: Merges OSR into OOD Detection because VLM benchmarks are identical [cited in paper]
vs. Generalized OOD Detection v1: Declares Semantic AD/ND inactive due to VLM performance saturation (99%+) [cited in paper]

Limitations

The survey focuses primarily on the image domain, excluding video domain tasks.
The analysis relies heavily on CLIP as the representative VLM; other architectures like SigLIP are less covered.
The declaration of fields as 'inactive' assumes current benchmark saturation implies the problem is solved, which may change with harder datasets.

Reproducibility

Code: https://github.com/AtsuMiyai/Awesome-OOD-VLM

The paper is a survey; the 'Code' provided is a GitHub repository containing a curated list of the surveyed papers, which is publicly available.

📊 Experiments & Results

Evaluation Setup

Systematic review of papers published in top venues (CVPR, NeurIPS, ICCV, ECCV, ICML, ICLR, AAAI, IJCAI, ACMMM, TPAMI, IJCV, TMLR) from 2021 to April 2025.

Benchmarks:

Literature Count (Bibliometric Analysis) [New]

Metrics:

Number of VLM-based papers in top venues
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The authors counted VLM-based papers in top venues (2021-2025) to determine which research fields remain active.
Top Venue Publications	Paper Count	1	26	+25
Top Venue Publications	Paper Count	3	22	+19
Top Venue Publications	Paper Count	1	26	+25

Main Takeaways

Sensory AD and OOD Detection are the two dominant, active research fields in the VLM era, with over 20 top-venue papers each.
Semantic AD and Novelty Detection have become inactive due to performance saturation (e.g., 99% on standard benchmarks) enabled by CLIP's pre-training.
Open Set Recognition (OSR) has effectively merged with OOD Detection, as their problem definitions and benchmarks in the VLM era are now identical.

📚 Prerequisite Knowledge

Prerequisites

Understanding of distribution shifts (covariate vs. semantic)
Familiarity with CLIP and zero-shot learning
Basic definitions of OOD, AD, and OSR

Key Terms

OOD Detection: Out-of-Distribution Detection—identifying test samples from a different distribution (usually semantic/label shift) than the training data

VLM: Vision Language Model—models like CLIP trained on image-text pairs that can perform zero-shot classification

Sensory AD: Sensory Anomaly Detection—detecting anomalies caused by covariate shift (e.g., defects, noise) where all normal data comes from the same semantic class

Semantic AD: Semantic Anomaly Detection—detecting samples belonging to new, unseen classes (label shift)

OSR: Open Set Recognition—a task requiring a model to classify known classes correctly while rejecting unknown classes; now largely integrated into OOD detection

Covariate Shift: A change in the distribution of input data features (e.g., lighting, texture, domain) while the relationship to the label remains consistent

Semantic Shift: A change where the input belongs to a entirely new object class or category not seen during training

Inductive Learning: Learning where the model generalizes from a training set to unseen test data (standard train-test split)

Transductive Learning: Learning where the model has access to both labeled training data and unlabeled test data (all observations) during the learning process