Image Captioning Algorithm Based on Sufficient Visual Information and Text Information

Yongqiang Zhao; Yuan Rao; Lianwei Wu; Cong Feng

doi:10.1007/978-3-030-63823-8_69

Image Captioning Algorithm Based on Sufficient Visual Information and Text Information

Yongqiang Zhao, Yuan Rao, Lianwei Wu, Cong Feng

Xi'an Jiaotong University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Most existing attention-based methods on image captioning focus on the current visual information and text information at each step to generate the next word, without considering the coherence between the visual information and the text information itself. We propose sufficient visual information (SVI) module to supplement the existing visual information contained in the network, and propose sufficient text information (STI) module to predict more text Words to supplement the text information contained in the network. Sufficient visual information module embed the attention value from the past two steps into the current attention to adapt to human visual coherence. Sufficient text information module can predict the next three words in one step, and jointly use their probabilities for inference. Finally, this paper combines these two modules to form an image captioning algorithm based on sufficient visual information and text information model (SVITI) to further integrate existing visual information and future text information in the network, thereby improving the image captioning performance of the model. These three methods are used in the classic image captioning algorithm, and have achieved achieve significant performance improvement compared to the latest method on the MS COCO dataset.

Original language	English
Title of host publication	Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings
Editors	Haiqin Yang, Kitsuchart Pasupa, Andrew Chi-Sing Leung, James T. Kwok, Jonathan H. Chan, Irwin King
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	607-615
Number of pages	9
ISBN (Print)	9783030638221
DOIs	https://doi.org/10.1007/978-3-030-63823-8_69
State	Published - 2020
Externally published	Yes
Event	27th International Conference on Neural Information Processing, ICONIP 2020 - Bangkok, Thailand Duration: 18 Nov 2020 → 22 Nov 2020

Publication series

Name	Communications in Computer and Information Science
Volume	1333
ISSN (Print)	1865-0929
ISSN (Electronic)	1865-0937

Conference

Conference	27th International Conference on Neural Information Processing, ICONIP 2020
Country/Territory	Thailand
City	Bangkok
Period	18/11/20 → 22/11/20

Keywords

Image captioning
Sufficient text information
Sufficient visual information

Access to Document

10.1007/978-3-030-63823-8_69

Cite this

Zhao, Y., Rao, Y., Wu, L., & Feng, C. (2020). Image Captioning Algorithm Based on Sufficient Visual Information and Text Information. In H. Yang, K. Pasupa, A. C.-S. Leung, J. T. Kwok, J. H. Chan, & I. King (Eds.), Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings (pp. 607-615). (Communications in Computer and Information Science; Vol. 1333). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-63823-8_69

Zhao, Yongqiang ; Rao, Yuan ; Wu, Lianwei et al. / Image Captioning Algorithm Based on Sufficient Visual Information and Text Information. Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings. editor / Haiqin Yang ; Kitsuchart Pasupa ; Andrew Chi-Sing Leung ; James T. Kwok ; Jonathan H. Chan ; Irwin King. Springer Science and Business Media Deutschland GmbH, 2020. pp. 607-615 (Communications in Computer and Information Science).

@inproceedings{c8acf095eea0400596bf25f2a562223a,

title = "Image Captioning Algorithm Based on Sufficient Visual Information and Text Information",

abstract = "Most existing attention-based methods on image captioning focus on the current visual information and text information at each step to generate the next word, without considering the coherence between the visual information and the text information itself. We propose sufficient visual information (SVI) module to supplement the existing visual information contained in the network, and propose sufficient text information (STI) module to predict more text Words to supplement the text information contained in the network. Sufficient visual information module embed the attention value from the past two steps into the current attention to adapt to human visual coherence. Sufficient text information module can predict the next three words in one step, and jointly use their probabilities for inference. Finally, this paper combines these two modules to form an image captioning algorithm based on sufficient visual information and text information model (SVITI) to further integrate existing visual information and future text information in the network, thereby improving the image captioning performance of the model. These three methods are used in the classic image captioning algorithm, and have achieved achieve significant performance improvement compared to the latest method on the MS COCO dataset.",

keywords = "Image captioning, Sufficient text information, Sufficient visual information",

author = "Yongqiang Zhao and Yuan Rao and Lianwei Wu and Cong Feng",

note = "Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 27th International Conference on Neural Information Processing, ICONIP 2020 ; Conference date: 18-11-2020 Through 22-11-2020",

year = "2020",

doi = "10.1007/978-3-030-63823-8_69",

language = "英语",

isbn = "9783030638221",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "607--615",

editor = "Haiqin Yang and Kitsuchart Pasupa and Leung, {Andrew Chi-Sing} and Kwok, {James T.} and Chan, {Jonathan H.} and Irwin King",

booktitle = "Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings",

}

Zhao, Y, Rao, Y, Wu, L & Feng, C 2020, Image Captioning Algorithm Based on Sufficient Visual Information and Text Information. in H Yang, K Pasupa, AC-S Leung, JT Kwok, JH Chan & I King (eds), Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings. Communications in Computer and Information Science, vol. 1333, Springer Science and Business Media Deutschland GmbH, pp. 607-615, 27th International Conference on Neural Information Processing, ICONIP 2020, Bangkok, Thailand, 18/11/20. https://doi.org/10.1007/978-3-030-63823-8_69

Image Captioning Algorithm Based on Sufficient Visual Information and Text Information. / Zhao, Yongqiang; Rao, Yuan; Wu, Lianwei et al.
Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings. ed. / Haiqin Yang; Kitsuchart Pasupa; Andrew Chi-Sing Leung; James T. Kwok; Jonathan H. Chan; Irwin King. Springer Science and Business Media Deutschland GmbH, 2020. p. 607-615 (Communications in Computer and Information Science; Vol. 1333).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Image Captioning Algorithm Based on Sufficient Visual Information and Text Information

AU - Zhao, Yongqiang

AU - Rao, Yuan

AU - Wu, Lianwei

AU - Feng, Cong

PY - 2020

Y1 - 2020

N2 - Most existing attention-based methods on image captioning focus on the current visual information and text information at each step to generate the next word, without considering the coherence between the visual information and the text information itself. We propose sufficient visual information (SVI) module to supplement the existing visual information contained in the network, and propose sufficient text information (STI) module to predict more text Words to supplement the text information contained in the network. Sufficient visual information module embed the attention value from the past two steps into the current attention to adapt to human visual coherence. Sufficient text information module can predict the next three words in one step, and jointly use their probabilities for inference. Finally, this paper combines these two modules to form an image captioning algorithm based on sufficient visual information and text information model (SVITI) to further integrate existing visual information and future text information in the network, thereby improving the image captioning performance of the model. These three methods are used in the classic image captioning algorithm, and have achieved achieve significant performance improvement compared to the latest method on the MS COCO dataset.

AB - Most existing attention-based methods on image captioning focus on the current visual information and text information at each step to generate the next word, without considering the coherence between the visual information and the text information itself. We propose sufficient visual information (SVI) module to supplement the existing visual information contained in the network, and propose sufficient text information (STI) module to predict more text Words to supplement the text information contained in the network. Sufficient visual information module embed the attention value from the past two steps into the current attention to adapt to human visual coherence. Sufficient text information module can predict the next three words in one step, and jointly use their probabilities for inference. Finally, this paper combines these two modules to form an image captioning algorithm based on sufficient visual information and text information model (SVITI) to further integrate existing visual information and future text information in the network, thereby improving the image captioning performance of the model. These three methods are used in the classic image captioning algorithm, and have achieved achieve significant performance improvement compared to the latest method on the MS COCO dataset.

KW - Image captioning

KW - Sufficient text information

KW - Sufficient visual information

UR - http://www.scopus.com/inward/record.url?scp=85097101829&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-63823-8_69

DO - 10.1007/978-3-030-63823-8_69

M3 - 会议稿件

AN - SCOPUS:85097101829

SN - 9783030638221

T3 - Communications in Computer and Information Science

SP - 607

EP - 615

BT - Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings

A2 - Yang, Haiqin

A2 - Pasupa, Kitsuchart

A2 - Leung, Andrew Chi-Sing

A2 - Kwok, James T.

A2 - Chan, Jonathan H.

A2 - King, Irwin

PB - Springer Science and Business Media Deutschland GmbH

T2 - 27th International Conference on Neural Information Processing, ICONIP 2020

Y2 - 18 November 2020 through 22 November 2020

ER -

Zhao Y, Rao Y, Wu L, Feng C. Image Captioning Algorithm Based on Sufficient Visual Information and Text Information. In Yang H, Pasupa K, Leung ACS, Kwok JT, Chan JH, King I, editors, Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 607-615. (Communications in Computer and Information Science). doi: 10.1007/978-3-030-63823-8_69

Image Captioning Algorithm Based on Sufficient Visual Information and Text Information

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this