Abstract
When we humans tell a long paragraph about an image, we usually first implicitly compose a mental "script'' and then comply with it to generate the paragraph. Inspired by this, we render the modern encoder-decoder based image paragraph captioning model such ability by proposing Hierarchical Scene Graph Encoder-Decoder (HSGED) for generating coherent and distinctive paragraphs. In particular, we use the image scene graph as the "script"to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentence scene graph RNN (SSG-RNN) to generate sub-graph level topics, which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. An efficient sentence-level loss is also proposed for encouraging the sequence of generated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph dataset and show that it not only achieves a new state-of-the-art 36.02 CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics.
Original language | English |
---|---|
Title of host publication | Proceedings of the 28th ACM International Conference on Multimedia |
Editors | Pradeep K. Atrey, Zhu Li |
Place of Publication | New York NY USA |
Publisher | Association for Computing Machinery (ACM) |
Pages | 4181-4189 |
Number of pages | 9 |
ISBN (Electronic) | 9781450379885 |
DOIs | |
Publication status | Published - 2020 |
Event | ACM International Conference on Multimedia 2020 - Online, United States of America Duration: 12 Oct 2020 → 16 Oct 2020 Conference number: 28th https://dl.acm.org/doi/proceedings/10.1145/3394171 (Proceedings) |
Conference
Conference | ACM International Conference on Multimedia 2020 |
---|---|
Abbreviated title | MM 2020 |
Country/Territory | United States of America |
Period | 12/10/20 → 16/10/20 |
Internet address |
|
Keywords
- hierarchical constrain
- hierarchical scene graph encoder decoder
- image paragraph generation
- scene graph