Exploiting textual queries for dynamically visual disambiguation

Zeren Sun; Yazhou Yao; Jimin Xiao; Lei Zhang; Jian Zhang; Zhenmin Tang

doi:10.1016/j.patcog.2020.107620

Exploiting textual queries for dynamically visual disambiguation

Zeren Sun, Yazhou Yao, Jimin Xiao, Lei Zhang, Jian Zhang, Zhenmin Tang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

13 引用（Scopus）

摘要

Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits the performance of current webly supervised models is the problem of visual polysemy. In this work, we present a novel framework that resolves visual polysemy by dynamically matching candidate text queries with retrieved images. Specifically, our proposed framework includes three major steps: we first discover and then dynamically select the text queries according to the keyword-based image search results, we employ the proposed saliency-guided deep multi-instance learning (MIL) network to remove outliers and learn classification models for visual disambiguation. Compared to existing methods, our proposed approach can figure out the right visual senses, adapt to dynamic changes in the search results, remove outliers, and jointly learn the classification models. Extensive experiments and ablation studies on CMU-Poly-30 and MIT-ISD datasets demonstrate the effectiveness of our proposed approach.

源语言	英语
文章编号	107620
期刊	Pattern Recognition
卷	110
DOI	https://doi.org/10.1016/j.patcog.2020.107620
出版状态	已出版 - 2月 2021

访问文件

10.1016/j.patcog.2020.107620

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{5226545940e3410bb74174e62c1df38b,

title = "Exploiting textual queries for dynamically visual disambiguation",

abstract = "Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits the performance of current webly supervised models is the problem of visual polysemy. In this work, we present a novel framework that resolves visual polysemy by dynamically matching candidate text queries with retrieved images. Specifically, our proposed framework includes three major steps: we first discover and then dynamically select the text queries according to the keyword-based image search results, we employ the proposed saliency-guided deep multi-instance learning (MIL) network to remove outliers and learn classification models for visual disambiguation. Compared to existing methods, our proposed approach can figure out the right visual senses, adapt to dynamic changes in the search results, remove outliers, and jointly learn the classification models. Extensive experiments and ablation studies on CMU-Poly-30 and MIT-ISD datasets demonstrate the effectiveness of our proposed approach.",

keywords = "Image search, Text queries, Visual disambiguation, Web images",

author = "Zeren Sun and Yazhou Yao and Jimin Xiao and Lei Zhang and Jian Zhang and Zhenmin Tang",

note = "Publisher Copyright: {\textcopyright} 2020",

year = "2021",

month = feb,

doi = "10.1016/j.patcog.2020.107620",

language = "英语",

volume = "110",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Exploiting textual queries for dynamically visual disambiguation

AU - Sun, Zeren

AU - Yao, Yazhou

AU - Xiao, Jimin

AU - Zhang, Lei

AU - Zhang, Jian

AU - Tang, Zhenmin

PY - 2021/2

Y1 - 2021/2

N2 - Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits the performance of current webly supervised models is the problem of visual polysemy. In this work, we present a novel framework that resolves visual polysemy by dynamically matching candidate text queries with retrieved images. Specifically, our proposed framework includes three major steps: we first discover and then dynamically select the text queries according to the keyword-based image search results, we employ the proposed saliency-guided deep multi-instance learning (MIL) network to remove outliers and learn classification models for visual disambiguation. Compared to existing methods, our proposed approach can figure out the right visual senses, adapt to dynamic changes in the search results, remove outliers, and jointly learn the classification models. Extensive experiments and ablation studies on CMU-Poly-30 and MIT-ISD datasets demonstrate the effectiveness of our proposed approach.

AB - Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits the performance of current webly supervised models is the problem of visual polysemy. In this work, we present a novel framework that resolves visual polysemy by dynamically matching candidate text queries with retrieved images. Specifically, our proposed framework includes three major steps: we first discover and then dynamically select the text queries according to the keyword-based image search results, we employ the proposed saliency-guided deep multi-instance learning (MIL) network to remove outliers and learn classification models for visual disambiguation. Compared to existing methods, our proposed approach can figure out the right visual senses, adapt to dynamic changes in the search results, remove outliers, and jointly learn the classification models. Extensive experiments and ablation studies on CMU-Poly-30 and MIT-ISD datasets demonstrate the effectiveness of our proposed approach.

KW - Image search

KW - Text queries

KW - Visual disambiguation

KW - Web images

UR - http://www.scopus.com/inward/record.url?scp=85090944429&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2020.107620

DO - 10.1016/j.patcog.2020.107620

M3 - 文章

AN - SCOPUS:85090944429

SN - 0031-3203

VL - 110

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 107620

ER -

Exploiting textual queries for dynamically visual disambiguation

摘要

访问文件

其它文件与链接

指纹

引用此