Auto-encoding and distilling scene graphs for image captioning

Xu Yang, Hanwang Zhang, Jianfei Cai

Research output: Contribution to journalArticleResearchpeer-review


We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inferences in discourse. For example, when we see the relation "a person on a bike", it is natural to replace "on" with "ride" and infer "a person riding a bike on a road" even when the "road" is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models reason as we humans and generate more descriptive captions. Specifically, we use the scene graph-a directed graph (G) where an object node is connected by adjective nodes and relationship nodes-to represent the complex structural layout of both image (I) and sentence (S). In the language domain, we use SGAE to learn a dictionary set (D) that helps reconstruct sentences in the S → G S → D → S auto-encoding pipeline, where D encodes the desired language prior and the decoder learns to caption from such a prior; in the vision-language domain, we share D in the I → G I → D → S pipeline and distill the knowledge of the language decoder of the auto-encoder to that of the encoder-decoder based image captioner to transfer the language inductive bias. In this way, the shared D provides hidden embeddings about descriptive collocations to the encoder-decoder and the distillation strategy teaches the encoder-decoder to transform these embeddings to human-like captions as the auto-encoder. Thanks to the scene graph representation, the shared dictionary set, and the Knowledge Distillation strategy, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, where our SGAE-based single-model achieves a new state-of-the-art 129.6 CIDEr-D on the Karpathy split, and a competitive 126.6 CIDEr-D (c40) on the official server, which is even comparable to other ensemble models. Furthermore, we validate the transferability of SGAE on two more challenging settings: transferring inductive bias from other language corpora and unpaired image captioning. Once again, the results of both settings confirm the superiority of SGAE.

Original languageEnglish
Number of pages14
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Publication statusAccepted/In press - 3 Dec 2020


  • Decoding
  • Dictionaries
  • Image Captioning
  • Knowledge Distillation
  • Memory Network
  • Pipelines
  • Roads
  • Scene Graph
  • Semantics
  • Training
  • Transfer Learning
  • Visualization

Cite this