Part-Aware Unified Representation of Language and Skeleton for Zero-Shot Action Recognition

Anqi Zhu, Qiuhong Ke, Mingming Gong, James Bailey

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

18 Citations (Scopus)

Abstract

While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this pa-per, we argue that relying solely on aligning label-level se-mantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we intro-duce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to gen-erate aligned textual and visual representations across dif-ferent levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skele-ton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzhl/PURLS.

Original languageEnglish
Title of host publicationProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
EditorsEric Mortensen
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages18761-18770
Number of pages10
ISBN (Electronic)9798350353006
ISBN (Print)9798350353013
DOIs
Publication statusPublished - 2024
EventIEEE Conference on Computer Vision and Pattern Recognition 2024 - Seattle, United States of America
Duration: 17 Jun 202421 Jun 2024
https://openaccess.thecvf.com/CVPR2024 (Proceedings)
https://cvpr.thecvf.com/Conferences/2024 (Website)
https://ieeexplore.ieee.org/xpl/conhome/10654794/proceeding (Proceedings)

Conference

ConferenceIEEE Conference on Computer Vision and Pattern Recognition 2024
Abbreviated titleCVPR 2024
Country/TerritoryUnited States of America
CitySeattle
Period17/06/2421/06/24
Internet address

Keywords

  • action recognition
  • computer vision
  • contrastive learning
  • large language model
  • representation learning
  • skeleton action recognition
  • zero-shot learning

Cite this