Abstract
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.
Original language | English |
---|---|
Title of host publication | Proceedings of the 28th ACM International Conference on Multimedia |
Editors | Pradeep K. Atrey, Zhu Li |
Place of Publication | New York NY USA |
Publisher | Association for Computing Machinery (ACM) |
Pages | 439-447 |
Number of pages | 9 |
ISBN (Electronic) | 9781450379885 |
DOIs | |
Publication status | Published - 2020 |
Event | ACM International Conference on Multimedia 2020 - Online, United States of America Duration: 12 Oct 2020 → 16 Oct 2020 Conference number: 28th https://dl.acm.org/doi/proceedings/10.1145/3394171 (Proceedings) |
Conference
Conference | ACM International Conference on Multimedia 2020 |
---|---|
Abbreviated title | MM 2020 |
Country/Territory | United States of America |
Period | 12/10/20 → 16/10/20 |
Internet address |
|
Keywords
- contrastive loss
- deepfake detection and localization
- modality dissonance
- neural networks