Jointing cross-modality retrieval to reweight attributes for image caption generation

Yuxuan Ding, Wei Wang, Mengmeng Jiang, Heng Liu, Donghu Deng, Wei Wei, Chunna Tian

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU 4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III
EditorsZhouchen Lin, Liang Wang, Tieniu Tan, Jian Yang, Guangming Shi, Nanning Zheng, Xilin Chen, Yanning Zhang
PublisherSpringer
Pages62-74
Number of pages13
ISBN (Print)9783030317256
DOIs
StatePublished - 2019
Event2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019 - Xi’an, China
Duration: 8 Nov 201911 Nov 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11859 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019
Country/TerritoryChina
CityXi’an
Period8/11/1911/11/19

Keywords

  • Cross-modality retrieval
  • Image captioning
  • Semantic attribute

Fingerprint

Dive into the research topics of 'Jointing cross-modality retrieval to reweight attributes for image caption generation'. Together they form a unique fingerprint.

Cite this