TY - JOUR
T1 - A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2
AU - Gao, Zan
AU - Guo, Leming
AU - Guan, Weili
AU - Liu, An An
AU - Ren, Tongwei
AU - Chen, Shengyong
N1 - Funding Information:
Manuscript received January 5, 2020; revised August 28, 2020 and October 19, 2020; accepted November 3, 2020. Date of publication November 24, 2020; date of current version December 4, 2020. This work was supported in part by the National Natural Science Foundation of China under Grant 61872270 and Grant 62020106004, in part by the Young Creative Team in universities of Shandong Province under Grant 2020KJN012, in part by the Jinan 20 projects in universities under Grant 2018GXRC014, and in part by the Tianjin New Generation Artificial Intelligence Major Program under Grant 18ZXZNGX00150 and Grant 19ZXZNGX00110. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Lucio Marcenaro. (Corresponding author: Leming Guo.) Zan Gao is with the Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014, China, and also with the Shandong Computer Science Center (National Supercomputer Center in Jinan), Jinan 250014, China.
Publisher Copyright:
© 1992-2012 IEEE.
PY - 2021/11/24
Y1 - 2021/11/24
N2 - Abstract - Action recognition is a popular research topic in the computer vision and machine learning domains. Although many action recognition methods have been proposed, only a few researchers have focused on cross-domain few-shot action recognition, which must often be performed in real security surveillance. Since the problems of action recognition, domain adaptation, and few-shot learning need to be simultaneously solved, the cross-domain few-shot action recognition task is a challenging problem. To solve these issues, in this work, we develop a novel end-to-end pairwise attentive adversarial spatiotemporal network (PASTN) to perform the cross-domain few-shot action recognition task, in which spatiotemporal information acquisition, few-shot learning, and video domain adaptation are realised in a unified framework. Specifically, the Resnet-50 network is selected as the backbone of the PASTN, and a 3D convolution block is embedded in the top layer of the 2D CNN (ResNet-50) to capture the spatiotemporal representations. Moreover, a novel attentive adversarial network architecture is designed to align the spatiotemporal dynamics actions with higher domain discrepancies. In addition, the pairwise margin discrimination loss is designed for the pairwise network architecture to improve the discrimination of the learned domain-invariant spatiotemporal feature. The results of extensive experiments performed on three public benchmarks of the cross-domain action recognition datasets, including SDAI Action I, SDAI Action II and UCF50-OlympicSport, demonstrate that the proposed PASTN can significantly outperform the state-of-the-art cross-domain action recognition methods in terms of both the accuracy and computational time. Even when only two labelled training samples per category are considered in the office1 scenario of the SDAI Action I dataset, the accuracy of the PASTN is improved by 6.1%, 10.9%, 16.8%, and 14% compared to that of the $TA^{3}N$ , TemporalPooling, I3D, and P3D methods, respectively.
AB - Abstract - Action recognition is a popular research topic in the computer vision and machine learning domains. Although many action recognition methods have been proposed, only a few researchers have focused on cross-domain few-shot action recognition, which must often be performed in real security surveillance. Since the problems of action recognition, domain adaptation, and few-shot learning need to be simultaneously solved, the cross-domain few-shot action recognition task is a challenging problem. To solve these issues, in this work, we develop a novel end-to-end pairwise attentive adversarial spatiotemporal network (PASTN) to perform the cross-domain few-shot action recognition task, in which spatiotemporal information acquisition, few-shot learning, and video domain adaptation are realised in a unified framework. Specifically, the Resnet-50 network is selected as the backbone of the PASTN, and a 3D convolution block is embedded in the top layer of the 2D CNN (ResNet-50) to capture the spatiotemporal representations. Moreover, a novel attentive adversarial network architecture is designed to align the spatiotemporal dynamics actions with higher domain discrepancies. In addition, the pairwise margin discrimination loss is designed for the pairwise network architecture to improve the discrimination of the learned domain-invariant spatiotemporal feature. The results of extensive experiments performed on three public benchmarks of the cross-domain action recognition datasets, including SDAI Action I, SDAI Action II and UCF50-OlympicSport, demonstrate that the proposed PASTN can significantly outperform the state-of-the-art cross-domain action recognition methods in terms of both the accuracy and computational time. Even when only two labelled training samples per category are considered in the office1 scenario of the SDAI Action I dataset, the accuracy of the PASTN is improved by 6.1%, 10.9%, 16.8%, and 14% compared to that of the $TA^{3}N$ , TemporalPooling, I3D, and P3D methods, respectively.
KW - action recognition
KW - attentive adversarial network
KW - Cross-domain learning
KW - few-shot
KW - pairwise margin discrimination loss
KW - TR3D
UR - http://www.scopus.com/inward/record.url?scp=85097183116&partnerID=8YFLogxK
U2 - 10.1109/TIP.2020.3038372
DO - 10.1109/TIP.2020.3038372
M3 - Article
C2 - 33232234
AN - SCOPUS:85097183116
VL - 30
SP - 767
EP - 782
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
SN - 1057-7149
ER -