Hierarchical Scene Graph Encoder-Decoder for image paragraph captioning

Xu Yang, Chongyang Gao, Hanwang Zhang, Jianfei Cai

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

16 Citations (Scopus)


When we humans tell a long paragraph about an image, we usually first implicitly compose a mental "script'' and then comply with it to generate the paragraph. Inspired by this, we render the modern encoder-decoder based image paragraph captioning model such ability by proposing Hierarchical Scene Graph Encoder-Decoder (HSGED) for generating coherent and distinctive paragraphs. In particular, we use the image scene graph as the "script"to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentence scene graph RNN (SSG-RNN) to generate sub-graph level topics, which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. An efficient sentence-level loss is also proposed for encouraging the sequence of generated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph dataset and show that it not only achieves a new state-of-the-art 36.02 CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics.

Original languageEnglish
Title of host publicationProceedings of the 28th ACM International Conference on Multimedia
EditorsPradeep K. Atrey, Zhu Li
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Number of pages9
ISBN (Electronic)9781450379885
Publication statusPublished - 2020
EventACM International Conference on Multimedia 2020 - Online, United States of America
Duration: 12 Oct 202016 Oct 2020
Conference number: 28th
https://dl.acm.org/doi/proceedings/10.1145/3394171 (Proceedings)


ConferenceACM International Conference on Multimedia 2020
Abbreviated titleMM 2020
Country/TerritoryUnited States of America
Internet address


  • hierarchical constrain
  • hierarchical scene graph encoder decoder
  • image paragraph generation
  • scene graph

Cite this