Vision-based active speaker detection in multiparty interaction

Kalin Stefanov, Jonas Beskow, Giampiero Salvi

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

This paper presents a supervised learning method for automatic visual detection of the active speaker in multiparty interactions. The presented detectors are built using a multimodal multiparty interaction dataset previously recorded with the purpose to explore patterns in the focus of visual attention of humans. Three different conditions are included: two humans involved in task-based interaction with a robot; the same two humans involved in task-based interaction where the robot is replaced by a third human, and a free three-party human interaction. The paper also presents an evaluation of the active speaker detection method in a speaker dependent experiment showing that the method achieves good accuracy rates in a fairly unconstrained scenario using only image data as input. The main goal of the presented method is to provide real-time detection of the active speaker within a broader framework implemented on a robot and used to generate natural focus of visual attention behavior during multiparty human-robot interactions.
Original languageEnglish
Title of host publicationGLU 2017 International Workshop on Grounding Language Understanding
EditorsGiampiero Salvi, Stéphane Dupont
Place of PublicationBaixas France
PublisherInternational Speech Communication Association
Pages5
Number of pages47
DOIs
Publication statusPublished - 2017
Externally publishedYes
EventInternational Workshop on Grounding Language Understanding 2017 - Stockholm, Sweden
Duration: 25 Aug 201725 Aug 2017
https://www.isca-speech.org/archive/GLU_2017/index.html

Conference

ConferenceInternational Workshop on Grounding Language Understanding 2017
CountrySweden
CityStockholm
Period25/08/1725/08/17
Internet address

Keywords

  • machine learning
  • active speaker detection
  • multiparty human-robot interaction

Cite this