ELEGANT: End-to-end language grounded speech denoiser for efficient generation of talking face

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review


Existing speech driven talking face generation methods (a.k.a Speech2Face models) provide realistic-looking talking avatars. Yet, they are not appropriate if (1) the input speech signals are from the wild which could contain background noise; and (2) the input signal contains hate speech. In the presence of in-the-wild audio signals, the Speech2Face models do a bad lip-sync, unwanted facial movements, and sudden jitters on the head movements. While on the other hand, the Speech2Face models do not perform any reasoning on language understanding of input speech signal which could enable malicious users to translate hateful speech to a synthetic talking face promoting internet, social, and political threats. In this paper, we serve dual objectives on a single go. To the best of our knowledge, our method ELEGANT is the first Speech2Face generative model that performs a language grounding on the input speech that eliminates the transfer of spurious features originating from audio noise. Subsequently, the text embedding is associated with the speech style and passed on to a generative model with a view to learn the phoneme-viseme correspondence. In this way, our proposed ELEGANT model suppresses negative and hateful words using text embedding and also suppresses audio-specific noises using text embedding since noise-to-phoneme mapping would be random. Our experiments show that adopting the speech-denoising technique through text grounding eliminates the transfer of spurious features originating from audio noise to the vision domain. Consequently, a good phoneme-viseme correspondence leads to a comparable performances of SSIM and PSNR scores w.r.t state of the art methods.

Original languageEnglish
Title of host publication2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
EditorsTing-Lan Lin, Yoshinobu Kajikawa, Zhaoxia Yin
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages6
ISBN (Electronic)9798350300673
ISBN (Print)9798350300680
Publication statusPublished - 2023
EventAnnual Summit and Conference of the Asia-Pacific-Signal-and-Information-Processing-Association (APSIPA) 2023 - Taipei, Taiwan
Duration: 31 Oct 20233 Nov 2023
Conference number: 15th
https://www.apsipa2023.org/ (Website)
https://ieeexplore.ieee.org/xpl/conhome/10317071/proceeding (Proceedings)


ConferenceAnnual Summit and Conference of the Asia-Pacific-Signal-and-Information-Processing-Association (APSIPA) 2023
Abbreviated titleAPSIPA 2023
Internet address

Cite this