An efficient Spatio-Temporal Pyramid Transformer for action detection

Yuetian Weng, Felix Pan, Mingfei Han, Xiaojun Chang, Bohan Zhuang

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6% mAP on THUMOS14, surpassing I3D+AFSD RGB model by over 10% and performing favorably against state-of-the-art AFSD that uses additional flow features with 31% fewer GFLOPs, which serves as an effective and efficient end-to-end Transformer-based framework for action detection. Code is available at https://github.com/ziplab/STPT.
Original languageEnglish
Title of host publication17th European Conference Tel Aviv, Israel, October 23–27, 2022 Proceedings, Part XXXIV
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Place of PublicationCham Switzerland
PublisherSpringer
Pages358-375
Number of pages18
ISBN (Electronic)9783031198304
ISBN (Print)9783031198298
DOIs
Publication statusPublished - 2022
EventEuropean Conference on Computer Vision 2022 - Tel Aviv, Israel
Duration: 23 Oct 202227 Oct 2022
Conference number: 17th
https://link.springer.com/book/10.1007/978-3-031-19830-4 (Proceedings)
https://eccv2022.ecva.net (Website)

Conference

ConferenceEuropean Conference on Computer Vision 2022
Abbreviated titleECCV 2022
Country/TerritoryIsrael
CityTel Aviv
Period23/10/2227/10/22
Internet address

Cite this