Abstract
Existing speech driven talking face generation methods (a.k.a Speech2Face models) provide realistic-looking talking avatars. Yet, they are not appropriate if (1) the input speech signals are from the wild which could contain background noise; and (2) the input signal contains hate speech. In the presence of in-the-wild audio signals, the Speech2Face models do a bad lip-sync, unwanted facial movements, and sudden jitters on the head movements. While on the other hand, the Speech2Face models do not perform any reasoning on language understanding of input speech signal which could enable malicious users to translate hateful speech to a synthetic talking face promoting internet, social, and political threats. In this paper, we serve dual objectives on a single go. To the best of our knowledge, our method ELEGANT is the first Speech2Face generative model that performs a language grounding on the input speech that eliminates the transfer of spurious features originating from audio noise. Subsequently, the text embedding is associated with the speech style and passed on to a generative model with a view to learn the phoneme-viseme correspondence. In this way, our proposed ELEGANT model suppresses negative and hateful words using text embedding and also suppresses audio-specific noises using text embedding since noise-to-phoneme mapping would be random. Our experiments show that adopting the speech-denoising technique through text grounding eliminates the transfer of spurious features originating from audio noise to the vision domain. Consequently, a good phoneme-viseme correspondence leads to a comparable performances of SSIM and PSNR scores w.r.t state of the art methods.
Original language | English |
---|---|
Title of host publication | 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) |
Editors | Ting-Lan Lin, Yoshinobu Kajikawa, Zhaoxia Yin |
Place of Publication | Piscataway NJ USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 833-838 |
Number of pages | 6 |
ISBN (Electronic) | 9798350300673 |
ISBN (Print) | 9798350300680 |
DOIs | |
Publication status | Published - 2023 |
Event | Annual Summit and Conference of the Asia-Pacific-Signal-and-Information-Processing-Association (APSIPA) 2023 - Taipei, Taiwan Duration: 31 Oct 2023 → 3 Nov 2023 Conference number: 15th https://www.apsipa2023.org/ (Website) https://ieeexplore.ieee.org/xpl/conhome/10317071/proceeding (Proceedings) |
Conference
Conference | Annual Summit and Conference of the Asia-Pacific-Signal-and-Information-Processing-Association (APSIPA) 2023 |
---|---|
Abbreviated title | APSIPA 2023 |
Country/Territory | Taiwan |
City | Taipei |
Period | 31/10/23 → 3/11/23 |
Internet address |
|