MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset for Versatile Wireless Sensing

📝 Paper Summary

Human Pose Estimation (HPE) Wireless Sensing Multi-modal Learning

MM-Fi is a large-scale, synchronized multi-modal dataset comprising RGB, depth, LiDAR, mmWave radar, and WiFi CSI data for non-intrusive 4D human sensing and action recognition.

Core Problem

Existing human sensing solutions rely on intrusive cameras (privacy concerns, lighting sensitivity) or wearable sensors (inconvenient), while current wireless datasets lack multi-modal diversity and scale.

Why it matters:

Camera-based sensing compromises privacy in homes/hospitals and fails in poor lighting.
Wearable sensors require strict user compliance, which is impractical for long-term monitoring.
Existing wireless datasets typically support fewer than three modalities, hindering the development of robust multi-modal fusion algorithms for healthcare and metaverse applications.

Concrete Example: In a dark bedroom or bathroom, a camera-based system fails to detect a fall due to poor lighting and privacy restrictions, while a WiFi-only system might lack the spatial resolution for fine-grained pose estimation. Current datasets do not provide synchronized data to train a system that fuses both signals effectively.

Key Novelty

Five-Modality Non-Intrusive Sensing

Integrates five distinct synchronized modalities (RGB, Depth, LiDAR, mmWave, WiFi CSI) into a single dataset, bridging vision and wireless sensing.
Provides 4D (spatial-temporal) labels including 3D keypoints and action categories for 27 actions across 40 subjects.
Uses a custom mobile robot platform (ROS-based) to capture aligned data in diverse environments, overcoming synchronization challenges between high-rate (camera) and low-rate (WiFi/radar) sensors.

Architecture

The mobile sensor platform and the five sensing modalities (RGB, Depth, LiDAR, mmWave, WiFi) capturing a human subject.

Evaluation Highlights

Dataset contains over 320k synchronized frames across 5 modalities from 40 human subjects.
Achieves high-quality ground truth annotations with a re-projection PCKh@0.5 of 95.66%.
Benchmarks demonstrate that fusing modalities (e.g., LiDAR + mmWave) significantly improves pose estimation accuracy compared to single wireless modalities.

Breakthrough Assessment

9/10

MM-Fi is the first dataset to synchronize five non-intrusive modalities (especially LiDAR, mmWave, and WiFi together), enabling new research in cross-modal supervision and robust wireless sensing.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal 3D Human Pose Estimation and Action Recognition

Inputs: Synchronized frames from five modalities: RGB images, Depth images, LiDAR point clouds, mmWave radar point clouds, WiFi CSI data

Outputs: 3D Human Pose (17 keypoints), Action Category (27 classes)

Pipeline Flow

Sensor Data Acquisition (RGB, Depth, LiDAR, mmWave, WiFi)
Synchronization (ROS-based time alignment)
Ground Truth Generation (Multi-view triangulation + Optimization)
Inference / Benchmarking Models

System Modules

Data Acquisition Platform (Input Processing)

Capture raw data from 5 sensors simultaneously

Model or implementation: Custom ROS-based Mobile Platform

Synchronization Module (Input Processing)

Align data streams temporally

Model or implementation: ROS Bag & Timestamp Matching

Ground Truth Generator

Compute accurate 3D pose labels

Model or implementation: HRNet-w48 + Optimization

Novel Architectural Elements

Integration of WiFi CSI, mmWave, and LiDAR on a single synchronized mobile platform
Optimization-based auto-labeling pipeline that enforces bone-length consistency and temporal smoothness to refine vision-based triangulation

Modeling

Base Model: Various baselines tested (e.g., HRNet for RGB, PointNet++ for LiDAR/mmWave, customized CNNs for WiFi)

Training Method: Optimization-based refinement for Ground Truth

Objective Functions:

Purpose: Minimize difference between projected 3D joints and detected 2D keypoints.

Formally: Reprojection Error in L_G
Purpose: Ensure temporal smoothness of joints.

Formally: Smoothness Loss in L_G (minimizing velocity changes)
Purpose: Enforce biological plausibility.

Formally: Bone Length Constraint (variance from average bone length)

Training Data:

40 subjects
27 actions (14 daily, 13 rehabilitation)
320k synchronized frames
Train/Val/Test splits provided in benchmarks (e.g., Cross-subject, Cross-environment)

Key Hyperparameters:

sampling_rate: 10Hz (global dataset frame rate)
wifi_sampling_rate: 100Hz (interpolated)
mmwave_aggregation_window: 0.5s
+ 1 more
lidar_resolution: 32-channel

Compute: Not reported in the paper

Comparison to Prior Work

vs. NTU RGB+D: MM-Fi adds LiDAR, mmWave, and WiFi modalities, enabling non-intrusive wireless sensing research
vs. HuMMan: MM-Fi includes WiFi and mmWave radar, focusing on device-free wireless sensing rather than just vision/LiDAR
vs. Waymo: MM-Fi focuses on indoor human activities with fine-grained pose labels, not outdoor driving scenes
+ 1 more
vs. WiPose: MM-Fi includes LiDAR and mmWave in addition to WiFi, and offers 3D annotations validated by multi-view vision [not cited in paper]

Limitations

mmWave point clouds are sparse (require aggregation of 0.5s frames)
WiFi CSI data has lower spatial resolution compared to LiDAR/Vision
Collected in controlled lab environments; generalization to highly cluttered unconstrained homes requires testing
Ground truth relies on vision-based triangulation, which may still suffer from heavy occlusion (though optimization mitigates this)

Reproducibility

Code: https://github.com/MM-Fi/MM-Fi

publicly available (https://github.com/MM-Fi/MM-Fi). Code for data loading and benchmarks provided. Raw data (bags) and processed numpy files available.

📊 Experiments & Results

Evaluation Setup

3D Human Pose Estimation (MPJPE) and Action Recognition (Accuracy) across different modalities and splits

Benchmarks:

MM-Fi Benchmark (3D Human Pose Estimation) [New]
MM-Fi Benchmark (Action Recognition) [New]

Metrics:

MPJPE (Mean Per Joint Position Error) in mm
PCKh@0.5 (Percentage of Correct Keypoints)
Accuracy (for action recognition)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MM-Fi Annotation Quality	PCKh@0.5	Not reported in the paper	95.66	Not reported in the paper
Benchmarks on pose estimation and action recognition are provided to establish baselines, but specific comparative numbers against external SOTA methods are not the primary focus; the paper focuses on dataset release. However, the paper reports validation metrics.
3D Position Annotation	Average Error (mm)	0	50	50

Experiment Figures

A sample synchronized frame from MM-Fi showing all 5 modalities aligned.

Main Takeaways

Vision modalities (RGB, Depth) provide the most accurate pose estimation but are intrusive.
LiDAR provides high-fidelity 3D geometry robust to lighting, superior to mmWave/WiFi but expensive.
Wireless modalities (mmWave, WiFi) are privacy-preserving but suffer from sparsity (mmWave) and low resolution (WiFi); fusion with other modalities shows promise.
The dataset successfully synchronizes disparate sensors (20Hz LiDAR to 1000Hz WiFi) to a common 10Hz frame rate with <25ms error.

📚 Prerequisite Knowledge

Prerequisites

Principles of wireless signal propagation (WiFi CSI, mmWave)
Point cloud processing (LiDAR, Radar)
3D Geometry and Camera Calibration
Human Pose Estimation (Skeleton-based)

Key Terms

CSI: Channel State Information—fine-grained WiFi signal properties describing how signals propagate from transmitter to receiver, used here to detect human motion

mmWave: Millimeter Wave radar—uses short-wavelength electromagnetic waves to detect object distance, velocity, and angle, offering medium-resolution point clouds

LiDAR: Light Detection and Ranging—uses laser pulses to create high-resolution 3D point clouds of the environment

FMCW: Frequency Modulated Continuous Wave—a radar technique that varies transmission frequency to measure distance and velocity

ROS: Robot Operating System—middleware used here to synchronize data streams from multiple sensors on the mobile platform

PCKh@0.5: Percentage of Correct Keypoints (head-normalized)—a metric counting a predicted joint as correct if it falls within 50% of the head segment length from the ground truth

MPJPE: Mean Per Joint Position Error—the average Euclidean distance between predicted and ground truth joint positions

HRNet: High-Resolution Network—a deep learning architecture used here to extract initial 2D keypoints from images for ground truth generation