Boosting Cross-Modal Retrieval with MVSE++ and Reciprocal Neighbors

Wei Wei; Mengmeng Jiang; Xiangnan Zhang; Heng Liu; Chunna Tian

doi:10.1109/ACCESS.2020.2992187

Boosting Cross-Modal Retrieval with MVSE++ and Reciprocal Neighbors

Wei Wei, Mengmeng Jiang, Xiangnan Zhang, Heng Liu, Chunna Tian

计算机学院

Xidian University

科研成果: 期刊稿件 › 文章 › 同行评审

9 引用（Scopus）

摘要

In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visual-semantic embedding (MVSE++) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to find a more accurate and detailed caption for the image. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the {k} nearest reciprocal neighbors in MVSE++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10.

源语言	英语
文章编号	9085386
页（从-至）	84642-84651
页数	10
期刊	IEEE Access
卷	8
DOI	https://doi.org/10.1109/ACCESS.2020.2992187
出版状态	已出版 - 2020

访问文件

10.1109/ACCESS.2020.2992187

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{2b75bb94a7fa4cc79a72ffa5130328d9,

title = "Boosting Cross-Modal Retrieval with MVSE++ and Reciprocal Neighbors",

abstract = "In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visual-semantic embedding (MVSE++) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to find a more accurate and detailed caption for the image. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the {k} nearest reciprocal neighbors in MVSE++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10.",

keywords = "Cross-modal retrieval, re-ranking method, reciprocal neighbors, scene context, visual-semantic embedding",

author = "Wei Wei and Mengmeng Jiang and Xiangnan Zhang and Heng Liu and Chunna Tian",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2020",

doi = "10.1109/ACCESS.2020.2992187",

language = "英语",

volume = "8",

pages = "84642--84651",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Boosting Cross-Modal Retrieval with MVSE++ and Reciprocal Neighbors

AU - Wei, Wei

AU - Jiang, Mengmeng

AU - Zhang, Xiangnan

AU - Liu, Heng

AU - Tian, Chunna

PY - 2020

Y1 - 2020

N2 - In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visual-semantic embedding (MVSE++) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to find a more accurate and detailed caption for the image. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the {k} nearest reciprocal neighbors in MVSE++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10.

AB - In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visual-semantic embedding (MVSE++) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to find a more accurate and detailed caption for the image. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the {k} nearest reciprocal neighbors in MVSE++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10.

KW - Cross-modal retrieval

KW - re-ranking method

KW - reciprocal neighbors

KW - scene context

KW - visual-semantic embedding

UR - http://www.scopus.com/inward/record.url?scp=85085187774&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2020.2992187

DO - 10.1109/ACCESS.2020.2992187

M3 - 文章

AN - SCOPUS:85085187774

SN - 2169-3536

VL - 8

SP - 84642

EP - 84651

JO - IEEE Access

JF - IEEE Access

M1 - 9085386

ER -

Boosting Cross-Modal Retrieval with MVSE++ and Reciprocal Neighbors

摘要

访问文件

其它文件与链接

指纹

引用此