Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

Hailong Ning; Bin Zhao; Yuan Yuan

doi:10.1109/TGRS.2021.3060705

Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

Hailong Ning, Bin Zhao, Yuan Yuan

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

42 Scopus citations

Abstract

With the development of earth observation technology, massive amounts of remote sensing (RS) images are acquired. To find useful information from these images, cross-modal RS image-voice retrieval provides a new insight. This article aims to study the task of RS image-voice retrieval so as to search effective information from massive amounts of RS data. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship to narrow the heterogeneous semantic gap between images and voices. However, apart from the pairwise relationship included in the data sets, the intramodality and nonpaired intermodality relationships should also be considered simultaneously since the semantic consistency among nonpaired representations plays an important role in the RS image-voice retrieval task. Inspired by this, a semantics-consistent representation learning (SCRL) method is proposed for RS image-voice retrieval. The main novelty is that the proposed method takes the pairwise, intramodality, and nonpaired intermodality relationships into account simultaneously, thereby improving the semantic consistency of the learned representations for the RS image-voice retrieval. The proposed SCRL method consists of two main steps: 1) semantics encoding and 2) SCRL. First, an image encoding network is adopted to extract high-level image features with a transfer learning strategy, and a voice encoding network with dilated convolution is devised to obtain high-level voice features. Second, a consistent representation space is conducted by modeling the three kinds of relationships to narrow the heterogeneous semantic gap and learn semantics-consistent representations across two modalities. Extensive experimental results on three challenging RS image-voice data sets, including Sydney, UCM, and RSICD image-voice data sets, show the effectiveness of the proposed method.

Original language	English
Journal	IEEE Transactions on Geoscience and Remote Sensing
Volume	60
DOIs	https://doi.org/10.1109/TGRS.2021.3060705
State	Published - 2022

Keywords

Heterogeneous semantic gap
remote sensing (RS) image-voice retrieval
semantics-consistent representation

Access to Document

10.1109/TGRS.2021.3060705

Cite this

@article{fa55f25e1e454937ac1b50e81946cb48,

title = "Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval",

abstract = "With the development of earth observation technology, massive amounts of remote sensing (RS) images are acquired. To find useful information from these images, cross-modal RS image-voice retrieval provides a new insight. This article aims to study the task of RS image-voice retrieval so as to search effective information from massive amounts of RS data. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship to narrow the heterogeneous semantic gap between images and voices. However, apart from the pairwise relationship included in the data sets, the intramodality and nonpaired intermodality relationships should also be considered simultaneously since the semantic consistency among nonpaired representations plays an important role in the RS image-voice retrieval task. Inspired by this, a semantics-consistent representation learning (SCRL) method is proposed for RS image-voice retrieval. The main novelty is that the proposed method takes the pairwise, intramodality, and nonpaired intermodality relationships into account simultaneously, thereby improving the semantic consistency of the learned representations for the RS image-voice retrieval. The proposed SCRL method consists of two main steps: 1) semantics encoding and 2) SCRL. First, an image encoding network is adopted to extract high-level image features with a transfer learning strategy, and a voice encoding network with dilated convolution is devised to obtain high-level voice features. Second, a consistent representation space is conducted by modeling the three kinds of relationships to narrow the heterogeneous semantic gap and learn semantics-consistent representations across two modalities. Extensive experimental results on three challenging RS image-voice data sets, including Sydney, UCM, and RSICD image-voice data sets, show the effectiveness of the proposed method.",

keywords = "Heterogeneous semantic gap, remote sensing (RS) image-voice retrieval, semantics-consistent representation",

author = "Hailong Ning and Bin Zhao and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2022",

doi = "10.1109/TGRS.2021.3060705",

language = "英语",

volume = "60",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

AU - Ning, Hailong

AU - Zhao, Bin

AU - Yuan, Yuan

PY - 2022

Y1 - 2022

N2 - With the development of earth observation technology, massive amounts of remote sensing (RS) images are acquired. To find useful information from these images, cross-modal RS image-voice retrieval provides a new insight. This article aims to study the task of RS image-voice retrieval so as to search effective information from massive amounts of RS data. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship to narrow the heterogeneous semantic gap between images and voices. However, apart from the pairwise relationship included in the data sets, the intramodality and nonpaired intermodality relationships should also be considered simultaneously since the semantic consistency among nonpaired representations plays an important role in the RS image-voice retrieval task. Inspired by this, a semantics-consistent representation learning (SCRL) method is proposed for RS image-voice retrieval. The main novelty is that the proposed method takes the pairwise, intramodality, and nonpaired intermodality relationships into account simultaneously, thereby improving the semantic consistency of the learned representations for the RS image-voice retrieval. The proposed SCRL method consists of two main steps: 1) semantics encoding and 2) SCRL. First, an image encoding network is adopted to extract high-level image features with a transfer learning strategy, and a voice encoding network with dilated convolution is devised to obtain high-level voice features. Second, a consistent representation space is conducted by modeling the three kinds of relationships to narrow the heterogeneous semantic gap and learn semantics-consistent representations across two modalities. Extensive experimental results on three challenging RS image-voice data sets, including Sydney, UCM, and RSICD image-voice data sets, show the effectiveness of the proposed method.

AB - With the development of earth observation technology, massive amounts of remote sensing (RS) images are acquired. To find useful information from these images, cross-modal RS image-voice retrieval provides a new insight. This article aims to study the task of RS image-voice retrieval so as to search effective information from massive amounts of RS data. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship to narrow the heterogeneous semantic gap between images and voices. However, apart from the pairwise relationship included in the data sets, the intramodality and nonpaired intermodality relationships should also be considered simultaneously since the semantic consistency among nonpaired representations plays an important role in the RS image-voice retrieval task. Inspired by this, a semantics-consistent representation learning (SCRL) method is proposed for RS image-voice retrieval. The main novelty is that the proposed method takes the pairwise, intramodality, and nonpaired intermodality relationships into account simultaneously, thereby improving the semantic consistency of the learned representations for the RS image-voice retrieval. The proposed SCRL method consists of two main steps: 1) semantics encoding and 2) SCRL. First, an image encoding network is adopted to extract high-level image features with a transfer learning strategy, and a voice encoding network with dilated convolution is devised to obtain high-level voice features. Second, a consistent representation space is conducted by modeling the three kinds of relationships to narrow the heterogeneous semantic gap and learn semantics-consistent representations across two modalities. Extensive experimental results on three challenging RS image-voice data sets, including Sydney, UCM, and RSICD image-voice data sets, show the effectiveness of the proposed method.

KW - Heterogeneous semantic gap

KW - remote sensing (RS) image-voice retrieval

KW - semantics-consistent representation

UR - http://www.scopus.com/inward/record.url?scp=85102295943&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2021.3060705

DO - 10.1109/TGRS.2021.3060705

M3 - 文章

AN - SCOPUS:85102295943

SN - 0196-2892

VL - 60

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

ER -

Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this