Abstract
In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visual-semantic embedding (MVSE++) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to find a more accurate and detailed caption for the image. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the {k} nearest reciprocal neighbors in MVSE++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10.
Original language | English |
---|---|
Article number | 9085386 |
Pages (from-to) | 84642-84651 |
Number of pages | 10 |
Journal | IEEE Access |
Volume | 8 |
DOIs | |
State | Published - 2020 |
Keywords
- Cross-modal retrieval
- re-ranking method
- reciprocal neighbors
- scene context
- visual-semantic embedding