A deep grouping fusion neural network for multimedia content understanding

Lingyun Song; Mengzhen Yu; Xuequn Shang; Yu Lu; Jun Liu; Ying Zhang; Zhanhuai Li

doi:10.1049/ipr2.12496

A deep grouping fusion neural network for multimedia content understanding

Lingyun Song, Mengzhen Yu, Xuequn Shang, Yu Lu, Jun Liu, Ying Zhang, Zhanhuai Li

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

4 引用（Scopus）

摘要

How Deep Neural Networks (DNNs) best cope with the understanding of multimedia contents still remains an open problem, mainly due to two factors. First, conventional DNNs cannot effectively learn the representations of the images with sparse visual information. For example, the images describing knowledge concepts in textbooks. Second, existing DNNs cannot effectively capture the fine-grained interactions between the images and text descriptions. To address these issues, we propose a deep Cross-Media Grouping Fusion Network (CMGFN), which mainly has two distinctive properties: 1) CMGFN can effectively learn visual features from the images with sparse visual information. This is achieved by first progressively adjusting the attention of convolution filters to valuable visual regions, and then enhancing the use of key visual information in feature construction. 2) By a cross-media grouping co-attention mechanism, CMGFN can effectively use the interactions between visual features of different semantics and textual descriptions, to learn cross-media features representing different fine-grained semantics in different groups. Empirical studies demonstrate that CMGFN not only achieves state-of-the-art performance on the multimedia documents containing sparse visual information, but also shows superior general applicability on other multimedia data, e.g., the multimedia fake news.

源语言	英语
页（从-至）	2398-2411
页数	14
期刊	IET Image Processing
卷	16
期	9
DOI	https://doi.org/10.1049/ipr2.12496
出版状态	已出版 - 7月 2022

访问文件

10.1049/ipr2.12496

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a355b28b5bdd43e3a3b4c28e4d6e4c7e,

title = "A deep grouping fusion neural network for multimedia content understanding",

abstract = "How Deep Neural Networks (DNNs) best cope with the understanding of multimedia contents still remains an open problem, mainly due to two factors. First, conventional DNNs cannot effectively learn the representations of the images with sparse visual information. For example, the images describing knowledge concepts in textbooks. Second, existing DNNs cannot effectively capture the fine-grained interactions between the images and text descriptions. To address these issues, we propose a deep Cross-Media Grouping Fusion Network (CMGFN), which mainly has two distinctive properties: 1) CMGFN can effectively learn visual features from the images with sparse visual information. This is achieved by first progressively adjusting the attention of convolution filters to valuable visual regions, and then enhancing the use of key visual information in feature construction. 2) By a cross-media grouping co-attention mechanism, CMGFN can effectively use the interactions between visual features of different semantics and textual descriptions, to learn cross-media features representing different fine-grained semantics in different groups. Empirical studies demonstrate that CMGFN not only achieves state-of-the-art performance on the multimedia documents containing sparse visual information, but also shows superior general applicability on other multimedia data, e.g., the multimedia fake news.",

author = "Lingyun Song and Mengzhen Yu and Xuequn Shang and Yu Lu and Jun Liu and Ying Zhang and Zhanhuai Li",

note = "Publisher Copyright: {\textcopyright} 2022 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.",

year = "2022",

month = jul,

doi = "10.1049/ipr2.12496",

language = "英语",

volume = "16",

pages = "2398--2411",

journal = "IET Image Processing",

issn = "1751-9659",

publisher = "John Wiley & Sons Inc.",

number = "9",

}

TY - JOUR

T1 - A deep grouping fusion neural network for multimedia content understanding

AU - Song, Lingyun

AU - Yu, Mengzhen

AU - Shang, Xuequn

AU - Lu, Yu

AU - Liu, Jun

AU - Zhang, Ying

AU - Li, Zhanhuai

PY - 2022/7

Y1 - 2022/7

N2 - How Deep Neural Networks (DNNs) best cope with the understanding of multimedia contents still remains an open problem, mainly due to two factors. First, conventional DNNs cannot effectively learn the representations of the images with sparse visual information. For example, the images describing knowledge concepts in textbooks. Second, existing DNNs cannot effectively capture the fine-grained interactions between the images and text descriptions. To address these issues, we propose a deep Cross-Media Grouping Fusion Network (CMGFN), which mainly has two distinctive properties: 1) CMGFN can effectively learn visual features from the images with sparse visual information. This is achieved by first progressively adjusting the attention of convolution filters to valuable visual regions, and then enhancing the use of key visual information in feature construction. 2) By a cross-media grouping co-attention mechanism, CMGFN can effectively use the interactions between visual features of different semantics and textual descriptions, to learn cross-media features representing different fine-grained semantics in different groups. Empirical studies demonstrate that CMGFN not only achieves state-of-the-art performance on the multimedia documents containing sparse visual information, but also shows superior general applicability on other multimedia data, e.g., the multimedia fake news.

AB - How Deep Neural Networks (DNNs) best cope with the understanding of multimedia contents still remains an open problem, mainly due to two factors. First, conventional DNNs cannot effectively learn the representations of the images with sparse visual information. For example, the images describing knowledge concepts in textbooks. Second, existing DNNs cannot effectively capture the fine-grained interactions between the images and text descriptions. To address these issues, we propose a deep Cross-Media Grouping Fusion Network (CMGFN), which mainly has two distinctive properties: 1) CMGFN can effectively learn visual features from the images with sparse visual information. This is achieved by first progressively adjusting the attention of convolution filters to valuable visual regions, and then enhancing the use of key visual information in feature construction. 2) By a cross-media grouping co-attention mechanism, CMGFN can effectively use the interactions between visual features of different semantics and textual descriptions, to learn cross-media features representing different fine-grained semantics in different groups. Empirical studies demonstrate that CMGFN not only achieves state-of-the-art performance on the multimedia documents containing sparse visual information, but also shows superior general applicability on other multimedia data, e.g., the multimedia fake news.

UR - http://www.scopus.com/inward/record.url?scp=85128225145&partnerID=8YFLogxK

U2 - 10.1049/ipr2.12496

DO - 10.1049/ipr2.12496

M3 - 文章

AN - SCOPUS:85128225145

SN - 1751-9659

VL - 16

SP - 2398

EP - 2411

JO - IET Image Processing

JF - IET Image Processing

IS - 9

ER -

A deep grouping fusion neural network for multimedia content understanding

摘要

访问文件

其它文件与链接

指纹

引用此