Egofalls: a visual-audio dataset and benchmark for fall detection using egocentric cameras

Published:

Falls are significant and often fatal for vulnerable populations such as the elderly. Previous work has addressed the detection of falls by relying on data captured by single sensors, images, or accelerometers. Firstly, we collected and published a new dataset on which we assess our proposed approach. We believe this to be the first public dataset of its kind. The dataset comprises 10,948 video samples from 14 subjects. Additionally, we relied on multimodal descriptors extracted from videos captured by egocentric cameras. Our proposed method includes a late decision fusion layer that builds on top of the extracted descriptors. We conducted ablation experiments to assess the performance of individual feature extractors, the fusion of visual information, and the fusion of both visual and audio information. Moreover, we experimented with internal and external cross-validation. Our results demonstrate that the fusion of audio and visual information through late decision fusion improves detection performance, making it a promising tool for fall prevention and mitigation.