TY - GEN
T1 - Jointing cross-modality retrieval to reweight attributes for image caption generation
AU - Ding, Yuxuan
AU - Wang, Wei
AU - Jiang, Mengmeng
AU - Liu, Heng
AU - Deng, Donghu
AU - Wei, Wei
AU - Tian, Chunna
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU 4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.
AB - Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU 4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.
KW - Cross-modality retrieval
KW - Image captioning
KW - Semantic attribute
UR - http://www.scopus.com/inward/record.url?scp=85084390539&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-31726-3_6
DO - 10.1007/978-3-030-31726-3_6
M3 - 会议稿件
AN - SCOPUS:85084390539
SN - 9783030317256
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 62
EP - 74
BT - Pattern Recognition and Computer Vision- 2nd Chinese Conference, PRCV 2019, Proceedings, Part III
A2 - Lin, Zhouchen
A2 - Wang, Liang
A2 - Tan, Tieniu
A2 - Yang, Jian
A2 - Shi, Guangming
A2 - Zheng, Nanning
A2 - Chen, Xilin
A2 - Zhang, Yanning
PB - Springer
T2 - 2nd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2019
Y2 - 8 November 2019 through 11 November 2019
ER -