Abstract
While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this pa-per, we argue that relying solely on aligning label-level se-mantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we intro-duce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to gen-erate aligned textual and visual representations across dif-ferent levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skele-ton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzhl/PURLS.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition |
| Editors | Eric Mortensen |
| Place of Publication | Piscataway NJ USA |
| Publisher | IEEE, Institute of Electrical and Electronics Engineers |
| Pages | 18761-18770 |
| Number of pages | 10 |
| ISBN (Electronic) | 9798350353006 |
| ISBN (Print) | 9798350353013 |
| DOIs | |
| Publication status | Published - 2024 |
| Event | IEEE Conference on Computer Vision and Pattern Recognition 2024 - Seattle, United States of America Duration: 17 Jun 2024 → 21 Jun 2024 https://openaccess.thecvf.com/CVPR2024 (Proceedings) https://cvpr.thecvf.com/Conferences/2024 (Website) https://ieeexplore.ieee.org/xpl/conhome/10654794/proceeding (Proceedings) |
Conference
| Conference | IEEE Conference on Computer Vision and Pattern Recognition 2024 |
|---|---|
| Abbreviated title | CVPR 2024 |
| Country/Territory | United States of America |
| City | Seattle |
| Period | 17/06/24 → 21/06/24 |
| Internet address |
|
Keywords
- action recognition
- computer vision
- contrastive learning
- large language model
- representation learning
- skeleton action recognition
- zero-shot learning