TY - JOUR
T1 - DeepActsNet
T2 - a deep ensemble framework combining features from face, hands, and body for action recognition
AU - Asif, Umar
AU - Mehta, Deval
AU - Von Cavallar, Stefan
AU - Tang, Jianbin
AU - Harrer, Stefan
N1 - Publisher Copyright:
© 2023
PY - 2023/7
Y1 - 2023/7
N2 - Human action recognition from videos has gained substantial focus due to its wide applications in the field of video understanding. Most of the existing approaches extract human skeleton data from videos to encode actions because of the invariance nature of the skeleton information with respect to lightning conditions and background changes. Despite their success in achieving high recognition accuracy, methods based on limited body joints fail to capture the nuances of subtle body parts which are highly relevant for discriminating similar actions. In this paper, we overcome this limitation by presenting a holistic framework for combining spatial and motion features from the body, face, and hands to develop a novel data representation termed “Deep Actions Stamps (DeepActs)” for video-based action recognition. Compared to the skeleton sequences based on limited body joints, DeepActs encode more effective spatio-temporal features that provide robustness against pose estimation noises and improve action recognition accuracy. We also present “DeepActsNet”, a deep learning based ensemble model which learns convolutional and structural features from Deep Action Stamps for highly accurate action recognition. Experiments on three challenging action recognition datasets (NTU60, NTU120, and SYSU) show that the proposed model produces significant improvements in the action recognition accuracy with less computational cost compared to the state-of-the-art methods.
AB - Human action recognition from videos has gained substantial focus due to its wide applications in the field of video understanding. Most of the existing approaches extract human skeleton data from videos to encode actions because of the invariance nature of the skeleton information with respect to lightning conditions and background changes. Despite their success in achieving high recognition accuracy, methods based on limited body joints fail to capture the nuances of subtle body parts which are highly relevant for discriminating similar actions. In this paper, we overcome this limitation by presenting a holistic framework for combining spatial and motion features from the body, face, and hands to develop a novel data representation termed “Deep Actions Stamps (DeepActs)” for video-based action recognition. Compared to the skeleton sequences based on limited body joints, DeepActs encode more effective spatio-temporal features that provide robustness against pose estimation noises and improve action recognition accuracy. We also present “DeepActsNet”, a deep learning based ensemble model which learns convolutional and structural features from Deep Action Stamps for highly accurate action recognition. Experiments on three challenging action recognition datasets (NTU60, NTU120, and SYSU) show that the proposed model produces significant improvements in the action recognition accuracy with less computational cost compared to the state-of-the-art methods.
KW - Activity recognition
KW - Convolutional neural networks
KW - Deep learning
UR - http://www.scopus.com/inward/record.url?scp=85150253792&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2023.109484
DO - 10.1016/j.patcog.2023.109484
M3 - Article
AN - SCOPUS:85150253792
SN - 0031-3203
VL - 139
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 109484
ER -