TY - GEN
T1 - Deep cross-modal retrieval for remote sensing image and audio
AU - Mao, Guo
AU - Yuan, Yuan
AU - Xiaoqiang, Lu
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/8
Y1 - 2018/10/8
N2 - Remote sensing image retrieval has many important applications in civilian and military fields, such as disaster monitoring and target detecting. However, the existing research on image retrieval, mainly including to two directions, text based and content based, cannot meet the rapid and convenient needs of some special applications and emergency scenes. Based on text, the retrieval is limited by keyboard inputting because of its lower efficiency for some urgent situations and based on content, it needs an example image as reference, which usually does not exist. Yet speech, as a direct, natural and efficient human-machine interactive way, can make up these shortcomings. Hence, a novel cross-modal retrieval method for remote sensing image and spoken audio is proposed in this paper. We first build a large-scale remote sensing image dataset with plenty of manual annotated spoken audio captions for the cross-modal retrieval task. Then a Deep Visual-Audio Network is designed to directly learn the correspondence of image and audio. And this model integrates feature extracting and multi-modal learning into the same network. Experiments on the proposed dataset verify the effectiveness of our approach and prove that it is feasible for speech-to-image retrieval.
AB - Remote sensing image retrieval has many important applications in civilian and military fields, such as disaster monitoring and target detecting. However, the existing research on image retrieval, mainly including to two directions, text based and content based, cannot meet the rapid and convenient needs of some special applications and emergency scenes. Based on text, the retrieval is limited by keyboard inputting because of its lower efficiency for some urgent situations and based on content, it needs an example image as reference, which usually does not exist. Yet speech, as a direct, natural and efficient human-machine interactive way, can make up these shortcomings. Hence, a novel cross-modal retrieval method for remote sensing image and spoken audio is proposed in this paper. We first build a large-scale remote sensing image dataset with plenty of manual annotated spoken audio captions for the cross-modal retrieval task. Then a Deep Visual-Audio Network is designed to directly learn the correspondence of image and audio. And this model integrates feature extracting and multi-modal learning into the same network. Experiments on the proposed dataset verify the effectiveness of our approach and prove that it is feasible for speech-to-image retrieval.
KW - Convolutional neural network
KW - Cross-modal retrieval
KW - Remote sensing image
KW - Spoken audio
UR - http://www.scopus.com/inward/record.url?scp=85056498790&partnerID=8YFLogxK
U2 - 10.1109/PRRS.2018.8486338
DO - 10.1109/PRRS.2018.8486338
M3 - 会议稿件
AN - SCOPUS:85056498790
T3 - 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing, PRRS 2018
BT - 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing, PRRS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 10th IAPR Workshop on Pattern Recognition in Remote Sensing, PRRS 2018
Y2 - 19 August 2018 through 20 August 2018
ER -