Abstract
Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we-9*6 propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3–4% with less than 1/5th the training parameters compared to other word embedding methods.
Original language | English |
---|---|
Title of host publication | 16th European Conference Glasgow, UK, August 23–28, 2020 Proceedings, Part IV |
Editors | Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm |
Place of Publication | Cham Switzerland |
Publisher | Springer |
Pages | 197-213 |
Number of pages | 17 |
ISBN (Electronic) | 9783030585488 |
ISBN (Print) | 9783030585471 |
DOIs | |
Publication status | Published - 2020 |
Externally published | Yes |
Event | European Conference on Computer Vision 2020 - Glasgow, United Kingdom Duration: 23 Aug 2020 → 28 Aug 2020 Conference number: 16th https://link.springer.com/book/10.1007/978-3-030-58452-8 (Proceedings) https://eccv2020.eu (Website) |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Publisher | Springer |
Volume | 12349 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | European Conference on Computer Vision 2020 |
---|---|
Abbreviated title | ECCV 2020 |
Country/Territory | United Kingdom |
City | Glasgow |
Period | 23/08/20 → 28/08/20 |
Internet address |
|
Keywords
- Image-sentence retrieval
- Multilingual word embeddings
- Scalable vision-language models