Caption alignment for low resource audio-visual data

Vighnesh Reddy Konda, Mayur Warialani, Rakesh Prasanth Achari, Varad Bhatnagar, Jayaprakash Akula, Preethi Jyothi, Ganesh Ramakrishnan, Gholamreza Haffari, Pankaj Singh

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review


Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
EditorsHelen Meng
Place of PublicationBaixas FRANCE
PublisherInternational Speech Communication Association (ISCA)
Number of pages5
Publication statusPublished - 2020
EventAnnual Conference of the International Speech Communication Association (was Eurospeech) 2020 - Shanghai, China
Duration: 25 Oct 202029 Oct 2020
Conference number: 21st (Proceedings) (Website)


ConferenceAnnual Conference of the International Speech Communication Association (was Eurospeech) 2020
Abbreviated titleInterspeech 2020
Internet address


  • Caption alignment for videos
  • Low-resource audio-visual corpus
  • Multimodal models

Cite this