MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset

📝 Paper Summary

Sign Language Recognition Video Action Recognition Multi-modal learning

MM-WLAuslan is the first large-scale, multi-view, multi-modal dataset for word-level Australian Sign Language recognition, containing over 282,000 videos of 3,215 glosses to benchmark ISLR systems.

Core Problem

Australian Sign Language (Auslan) lacks a dedicated large-scale, word-level dataset necessary for developing robust Isolated Sign Language Recognition (ISLR) systems.

Why it matters:

Existing datasets are either limited in vocabulary size (e.g., Purdue RVL-SLLL) or lack critical depth/multi-view information (e.g., WLASL, MS-ASL), hindering models from learning 3D spatial dynamics.
Regional sign languages like Auslan are distinct from ASL/BSL; without specific datasets, assistive technologies cannot support the 3.6 million Australians with hearing loss.
Current ISLR methods often fail in real-world scenarios due to occlusion and viewpoint variations, which single-view RGB datasets cannot address.

Concrete Example: A sign recognition model trained on single-view RGB data might fail to distinguish between signs that look similar from the front but differ in depth or side profile. Current datasets like WLASL provide no depth or side views to correct this ambiguity.

Key Novelty

MM-WLAuslan Dataset

Curates the largest Auslan dataset to date with 3,215 glosses and 282,000+ videos, significantly expanding beyond previous small-scale Auslan attempts.
Captures every sign simultaneously from four distinct angles (left-front, front, right-front) using two camera types (Kinect-V2, RealSense) to enable robust multi-view and cross-camera research.
Includes diverse testing subsets (Studio, In-the-Wild, Synthetic Background, Temporal Disturbance) to rigorously evaluate model robustness against real-world variations.

Architecture

The multi-view, multi-modal recording setup showing the positioning of cameras relative to the signer.

Evaluation Highlights

Benchmarked state-of-the-art methods (e.g., MST-Net) achieve significantly lower accuracy on Cross-Camera settings compared to consistent settings, highlighting the domain gap challenge.
Multi-view fusion (using all 4 views) consistently outperforms single-view baselines, demonstrating the value of the dataset's multi-perspective recordings.
The dataset establishes a challenging benchmark where current SOTA methods struggle on 'In-the-Wild' and 'Temporal Disturbance' test sets compared to studio conditions.

Breakthrough Assessment

9/10

This is a foundational contribution for Auslan research. It fills a massive gap by providing a dataset comparable in scale to major ASL/CSL datasets but with superior multi-modal/multi-view richness.

⚙️ Technical Details

Problem Definition

Setting: Isolated Sign Language Recognition (ISLR) classifying video sequences into discrete gloss labels.

Inputs: Video sequence V = {frame_1, ..., frame_T} containing a single sign, potentially with RGB, Depth, and Pose modalities across multiple views.

Outputs: Predicted gloss label y from a vocabulary of 3,215 classes.

Pipeline Flow

Data Recording (Multi-view/Multi-modal)
Data Processing (Background Removal/Cropping)
Benchmark Evaluation (Training & Testing)

System Modules

Recording Setup

Capture synchronized video from 4 angles

Model or implementation: Hardware: 3x Kinect-V2, 1x RealSense

Processing Pipeline

Clean and standardize data

Model or implementation: AlphaPose (for keypoints), BackgroundRemover

Novel Architectural Elements

Simultaneous 4-camera recording setup combining Time-of-Flight (Kinect) and Stereo Vision (RealSense) sensors for cross-camera analysis

Modeling

Base Model: Benchmark models include I3D, C3D, ResNet (2D-CNN), ST-GCN, and MST-Net

Training Method: Standard supervised training on the new dataset splits

Objective Functions:

Purpose: Minimize classification error.

Formally: Cross-Entropy Loss.

Training Data:

282,900 total videos
3,215 glosses x 73 signers
Split ratio: 6:1:4 (Train:Val:Test)
Test set split further into Studio (STU), In-the-Wild (ITW), Synthetic (SYN), Temporal Disturbance (TED)

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. WLASL: MM-WLAuslan includes depth and 4 views per sample vs. single RGB view
vs. DGS Kinect 40: MM-WLAuslan has 3,215 glosses vs. 40 glosses
vs. MS-ASL: MM-WLAuslan is captured in studio with multi-modal sensors vs. scraped YouTube videos (high noise, no depth)

Limitations

The dataset is recorded in a controlled studio environment (green screen), which may not fully reflect natural lighting/backgrounds despite synthetic augmentation.
Test set 'In-the-Wild' is simulated via background replacement rather than being truly recorded in diverse real-world locations.
The paper focuses on dataset construction and benchmarking; it does not propose a novel model architecture.

Reproducibility

Code: https://github.com/MM-WLAuslan/MM-WLAuslan

Dataset and benchmarks are publicly available at https://github.com/MM-WLAuslan/MM-WLAuslan. The paper details the split ratios and camera setups precisely.

📊 Experiments & Results

Evaluation Setup

Isolated Sign Language Recognition on MM-WLAuslan splits

Benchmarks:

MM-WLAuslan (Isolated Sign Language Recognition) [New]

Metrics:

Top-1 Accuracy
Top-5 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Benchmark results demonstrate the dataset's difficulty and the impact of different modalities/views.
MM-WLAuslan (Test Set)	Top-1 Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Examples of the four test subsets: Studio (STU), In-the-Wild (ITW), Synthetic (SYN), and Temporal Disturbance (TED).

Main Takeaways

MM-WLAuslan is the largest and most diverse Auslan dataset, enabling robust ISLR research.
Multi-view fusion generally improves recognition performance over single-view approaches.
Cross-camera and In-the-Wild settings present significant performance drops for current SOTA models, highlighting areas for future research.
The dataset supports investigation into both pixel-based and pose-based methods due to high-quality RGB-D and keypoint availability.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Isolated Sign Language Recognition (ISLR)
Familiarity with RGB-D sensors (Kinect, RealSense)
Basic knowledge of video action recognition models (3D CNNs, GCNs)

Key Terms

ISLR: Isolated Sign Language Recognition—classifying a single sign video into a word/gloss, distinct from continuous sign language translation.

Gloss: A written label (usually an English word) that represents a specific sign.

RGB-D: Red, Green, Blue, and Depth—video data that includes color and distance information for each pixel.

ST-GCN: Spatio-Temporal Graph Convolutional Network—a deep learning architecture that models the human skeleton as a graph moving over time.

MST-Net: Multi-scale Spatial Temporal Network—a state-of-the-art method for sign language recognition.

Kinect-V2: A depth camera using Time-of-Flight technology, offering high resolution.

RealSense: A depth camera using stereo vision technology, offering higher frame rates.

In-the-Wild (ITW): Test data where green screen backgrounds are replaced with real-world dynamic/static scenes to simulate natural environments.