Jointing cross-modality retrieval to reweight attributes for image caption generation

Yuxuan Ding; Wei Wang; Mengmeng Jiang; Heng Liu; Donghu Deng; Wei Wei; Chunna Tian

doi:10.1007/978-3-030-31726-3_6

Jointing cross-modality retrieval to reweight attributes for image caption generation

Yuxuan Ding, Wei Wang, Mengmeng Jiang, Heng Liu, Donghu Deng, Wei Wei, Chunna Tian

School of Computer Science

Xidian University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU 4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.

Original language	English
Title of host publication	Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III
Editors	Zhouchen Lin, Liang Wang, Tieniu Tan, Jian Yang, Guangming Shi, Nanning Zheng, Xilin Chen, Yanning Zhang
Publisher	Springer
Pages	62-74
Number of pages	13
ISBN (Print)	9783030317256
DOIs	https://doi.org/10.1007/978-3-030-31726-3_6
State	Published - 2019
Event	2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019 - Xi’an, China Duration: 8 Nov 2019 → 11 Nov 2019

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	11859 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019
Country/Territory	China
City	Xi’an
Period	8/11/19 → 11/11/19

Keywords

Cross-modality retrieval
Image captioning
Semantic attribute

Access to Document

10.1007/978-3-030-31726-3_6

Cite this

Ding, Y., Wang, W., Jiang, M., Liu, H., Deng, D., Wei, W., & Tian, C. (2019). Jointing cross-modality retrieval to reweight attributes for image caption generation. In Z. Lin, L. Wang, T. Tan, J. Yang, G. Shi, N. Zheng, X. Chen, & Y. Zhang (Eds.), Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III (pp. 62-74). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11859 LNCS). Springer. https://doi.org/10.1007/978-3-030-31726-3_6

Ding, Yuxuan ; Wang, Wei ; Jiang, Mengmeng et al. / Jointing cross-modality retrieval to reweight attributes for image caption generation. Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III. editor / Zhouchen Lin ; Liang Wang ; Tieniu Tan ; Jian Yang ; Guangming Shi ; Nanning Zheng ; Xilin Chen ; Yanning Zhang. Springer, 2019. pp. 62-74 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{513a4b5d36184532a680cc995d4e125f,

title = "Jointing cross-modality retrieval to reweight attributes for image caption generation",

abstract = "Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU 4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.",

keywords = "Cross-modality retrieval, Image captioning, Semantic attribute",

author = "Yuxuan Ding and Wei Wang and Mengmeng Jiang and Heng Liu and Donghu Deng and Wei Wei and Chunna Tian",

note = "Publisher Copyright: {\textcopyright} Springer Nature Switzerland AG 2019.; 2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019 ; Conference date: 08-11-2019 Through 11-11-2019",

year = "2019",

doi = "10.1007/978-3-030-31726-3_6",

language = "英语",

isbn = "9783030317256",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "62--74",

editor = "Zhouchen Lin and Liang Wang and Tieniu Tan and Jian Yang and Guangming Shi and Nanning Zheng and Xilin Chen and Yanning Zhang",

booktitle = "Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III",

}

Ding, Y, Wang, W, Jiang, M, Liu, H, Deng, D, Wei, W & Tian, C 2019, Jointing cross-modality retrieval to reweight attributes for image caption generation. in Z Lin, L Wang, T Tan, J Yang, G Shi, N Zheng, X Chen & Y Zhang (eds), Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11859 LNCS, Springer, pp. 62-74, 2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019, Xi’an, China, 8/11/19. https://doi.org/10.1007/978-3-030-31726-3_6

Jointing cross-modality retrieval to reweight attributes for image caption generation. / Ding, Yuxuan; Wang, Wei; Jiang, Mengmeng et al.
Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III. ed. / Zhouchen Lin; Liang Wang; Tieniu Tan; Jian Yang; Guangming Shi; Nanning Zheng; Xilin Chen; Yanning Zhang. Springer, 2019. p. 62-74 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11859 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Jointing cross-modality retrieval to reweight attributes for image caption generation

AU - Ding, Yuxuan

AU - Wang, Wei

AU - Jiang, Mengmeng

AU - Liu, Heng

AU - Deng, Donghu

AU - Wei, Wei

AU - Tian, Chunna

N1 - Publisher Copyright: © Springer Nature Switzerland AG 2019.

PY - 2019

Y1 - 2019

N2 - Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU 4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.

AB - Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU 4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.

KW - Cross-modality retrieval

KW - Image captioning

KW - Semantic attribute

UR - http://www.scopus.com/inward/record.url?scp=85084390539&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-31726-3_6

DO - 10.1007/978-3-030-31726-3_6

M3 - 会议稿件

AN - SCOPUS:85084390539

SN - 9783030317256

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 62

EP - 74

BT - Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III

A2 - Lin, Zhouchen

A2 - Wang, Liang

A2 - Tan, Tieniu

A2 - Yang, Jian

A2 - Shi, Guangming

A2 - Zheng, Nanning

A2 - Chen, Xilin

A2 - Zhang, Yanning

PB - Springer

T2 - 2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019

Y2 - 8 November 2019 through 11 November 2019

ER -

Ding Y, Wang W, Jiang M, Liu H, Deng D, Wei W et al. Jointing cross-modality retrieval to reweight attributes for image caption generation. In Lin Z, Wang L, Tan T, Yang J, Shi G, Zheng N, Chen X, Zhang Y, editors, Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III. Springer. 2019. p. 62-74. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-31726-3_6

Jointing cross-modality retrieval to reweight attributes for image caption generation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this