Abstract
The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6% mAP on THUMOS14, surpassing I3D+AFSD RGB model by over 10% and performing favorably against state-of-the-art AFSD that uses additional flow features with 31% fewer GFLOPs, which serves as an effective and efficient end-to-end Transformer-based framework for action detection. Code is available at https://github.com/ziplab/STPT.
| Original language | English |
|---|---|
| Title of host publication | 17th European Conference Tel Aviv, Israel, October 23–27, 2022 Proceedings, Part XXXIV |
| Editors | Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner |
| Place of Publication | Cham Switzerland |
| Publisher | Springer |
| Pages | 358-375 |
| Number of pages | 18 |
| ISBN (Electronic) | 9783031198304 |
| ISBN (Print) | 9783031198298 |
| DOIs | |
| Publication status | Published - 2022 |
| Event | European Conference on Computer Vision 2022 - Tel Aviv, Israel Duration: 23 Oct 2022 → 27 Oct 2022 Conference number: 17th https://link.springer.com/book/10.1007/978-3-031-19830-4 (Proceedings) https://eccv2022.ecva.net (Website) |
Conference
| Conference | European Conference on Computer Vision 2022 |
|---|---|
| Abbreviated title | ECCV 2022 |
| Country/Territory | Israel |
| City | Tel Aviv |
| Period | 23/10/22 → 27/10/22 |
| Internet address |
|
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver