← Back to Paper List

MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset

Xin Shen, Heming Du, Hongwei Sheng, Shuyun Wang, Huiqiang Chen, Huiqiang Chen, Zhuojie Wu, Xiaobiao Du, Jiaying Ying, Rui Lu, Qingzheng Xu, Xin Yu
The University of Queensland
Neural Information Processing Systems (2024)
MM Benchmark

📝 Paper Summary

Sign Language Recognition Video Action Recognition Multi-modal learning
MM-WLAuslan is the first large-scale, multi-view, multi-modal dataset for word-level Australian Sign Language recognition, containing over 282,000 videos of 3,215 glosses to benchmark ISLR systems.
Core Problem
Australian Sign Language (Auslan) lacks a dedicated large-scale, word-level dataset necessary for developing robust Isolated Sign Language Recognition (ISLR) systems.
Why it matters:
  • Existing datasets are either limited in vocabulary size (e.g., Purdue RVL-SLLL) or lack critical depth/multi-view information (e.g., WLASL, MS-ASL), hindering models from learning 3D spatial dynamics.
  • Regional sign languages like Auslan are distinct from ASL/BSL; without specific datasets, assistive technologies cannot support the 3.6 million Australians with hearing loss.
  • Current ISLR methods often fail in real-world scenarios due to occlusion and viewpoint variations, which single-view RGB datasets cannot address.
Concrete Example: A sign recognition model trained on single-view RGB data might fail to distinguish between signs that look similar from the front but differ in depth or side profile. Current datasets like WLASL provide no depth or side views to correct this ambiguity.
Key Novelty
MM-WLAuslan Dataset
  • Curates the largest Auslan dataset to date with 3,215 glosses and 282,000+ videos, significantly expanding beyond previous small-scale Auslan attempts.
  • Captures every sign simultaneously from four distinct angles (left-front, front, right-front) using two camera types (Kinect-V2, RealSense) to enable robust multi-view and cross-camera research.
  • Includes diverse testing subsets (Studio, In-the-Wild, Synthetic Background, Temporal Disturbance) to rigorously evaluate model robustness against real-world variations.
Architecture
Architecture Figure Figure 1
The multi-view, multi-modal recording setup showing the positioning of cameras relative to the signer.
Evaluation Highlights
  • Benchmarked state-of-the-art methods (e.g., MST-Net) achieve significantly lower accuracy on Cross-Camera settings compared to consistent settings, highlighting the domain gap challenge.
  • Multi-view fusion (using all 4 views) consistently outperforms single-view baselines, demonstrating the value of the dataset's multi-perspective recordings.
  • The dataset establishes a challenging benchmark where current SOTA methods struggle on 'In-the-Wild' and 'Temporal Disturbance' test sets compared to studio conditions.
Breakthrough Assessment
9/10
This is a foundational contribution for Auslan research. It fills a massive gap by providing a dataset comparable in scale to major ASL/CSL datasets but with superior multi-modal/multi-view richness.
×