Abstract
The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6% mAP on THUMOS14, surpassing I3D+AFSD RGB model by over 10% and performing favorably against state-of-the-art AFSD that uses additional flow features with 31% fewer GFLOPs, which serves as an effective and efficient end-to-end Transformer-based framework for action detection. Code is available at https://github.com/ziplab/STPT.
Original language | English |
---|---|
Title of host publication | 17th European Conference Tel Aviv, Israel, October 23–27, 2022 Proceedings, Part XXXIV |
Editors | Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner |
Place of Publication | Cham Switzerland |
Publisher | Springer |
Pages | 358-375 |
Number of pages | 18 |
ISBN (Electronic) | 9783031198304 |
ISBN (Print) | 9783031198298 |
DOIs | |
Publication status | Published - 2022 |
Event | European Conference on Computer Vision 2022 - Tel Aviv, Israel Duration: 23 Oct 2022 → 27 Oct 2022 Conference number: 17th https://link.springer.com/book/10.1007/978-3-031-19830-4 (Proceedings) https://eccv2022.ecva.net (Website) |
Conference
Conference | European Conference on Computer Vision 2022 |
---|---|
Abbreviated title | ECCV 2022 |
Country/Territory | Israel |
City | Tel Aviv |
Period | 23/10/22 → 27/10/22 |
Internet address |
|