EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

Geng Chen; Qingyue Wang; Bo Dong; Ruitao Ma; Nian Liu; Huazhu Fu; Yong Xia

doi:10.1109/TNNLS.2024.3358858

EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

Geng Chen, Qingyue Wang, Bo Dong, Ruitao Ma, Nian Liu, Huazhu Fu, Yong Xia

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

12 引用（Scopus）

摘要

RGB-D salient object detection (SOD) has gained tremendous attention in recent years. In particular, transformer has been employed and shown great potential. However, existing transformer models usually overlook the vital edge information, which is a major issue restricting the further improvement of SOD accuracy. To this end, we propose a novel edge-aware RGB-D SOD transformer, called EM-Trans, which explicitly models the edge information in a dual-band decomposition framework. Specifically, we employ two parallel decoder networks to learn the high-frequency edge and low-frequency body features from the low- and high-level features extracted from a two-steam multimodal backbone network, respectively. Next, we propose a cross-attention complementarity exploration module to enrich the edge/body features by exploiting the multimodal complementarity information. The refined features are then fed into our proposed color-hint guided fusion module for enhancing the depth feature and fusing the multimodal features. Finally, the resulting features are fused using our deeply supervised progressive fusion module, which progressively integrates edge and body features for predicting saliency maps. Our model explicitly considers the edge information for accurate RGB-D SOD, overcoming the limitations of existing methods and effectively improving the performance. Extensive experiments on benchmark datasets demonstrate that EM-Trans is an effective RGB-D SOD framework that outperforms the current state-of-the-art models, both quantitatively and qualitatively. A further extension to RGB-T SOD demonstrates the promising potential of our model in various kinds of multimodal SOD tasks.

源语言	英语
页（从-至）	3175-3188
页数	14
期刊	IEEE Transactions on Neural Networks and Learning Systems
卷	36
期	2
DOI	https://doi.org/10.1109/TNNLS.2024.3358858
出版状态	已出版 - 2025

访问文件

10.1109/TNNLS.2024.3358858

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{20ff5ba221cf49deaccc4c6b9d8ebc58,

title = "EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection",

abstract = "RGB-D salient object detection (SOD) has gained tremendous attention in recent years. In particular, transformer has been employed and shown great potential. However, existing transformer models usually overlook the vital edge information, which is a major issue restricting the further improvement of SOD accuracy. To this end, we propose a novel edge-aware RGB-D SOD transformer, called EM-Trans, which explicitly models the edge information in a dual-band decomposition framework. Specifically, we employ two parallel decoder networks to learn the high-frequency edge and low-frequency body features from the low- and high-level features extracted from a two-steam multimodal backbone network, respectively. Next, we propose a cross-attention complementarity exploration module to enrich the edge/body features by exploiting the multimodal complementarity information. The refined features are then fed into our proposed color-hint guided fusion module for enhancing the depth feature and fusing the multimodal features. Finally, the resulting features are fused using our deeply supervised progressive fusion module, which progressively integrates edge and body features for predicting saliency maps. Our model explicitly considers the edge information for accurate RGB-D SOD, overcoming the limitations of existing methods and effectively improving the performance. Extensive experiments on benchmark datasets demonstrate that EM-Trans is an effective RGB-D SOD framework that outperforms the current state-of-the-art models, both quantitatively and qualitatively. A further extension to RGB-T SOD demonstrates the promising potential of our model in various kinds of multimodal SOD tasks.",

keywords = "Edge-aware model, multimodal learning, salient object detection (SOD), transformer",

author = "Geng Chen and Qingyue Wang and Bo Dong and Ruitao Ma and Nian Liu and Huazhu Fu and Yong Xia",

note = "Publisher Copyright: {\textcopyright} 2012 IEEE.",

year = "2025",

doi = "10.1109/TNNLS.2024.3358858",

language = "英语",

volume = "36",

pages = "3175--3188",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "2",

}

TY - JOUR

T1 - EM-Trans

T2 - Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

AU - Chen, Geng

AU - Wang, Qingyue

AU - Dong, Bo

AU - Ma, Ruitao

AU - Liu, Nian

AU - Fu, Huazhu

AU - Xia, Yong

PY - 2025

Y1 - 2025

N2 - RGB-D salient object detection (SOD) has gained tremendous attention in recent years. In particular, transformer has been employed and shown great potential. However, existing transformer models usually overlook the vital edge information, which is a major issue restricting the further improvement of SOD accuracy. To this end, we propose a novel edge-aware RGB-D SOD transformer, called EM-Trans, which explicitly models the edge information in a dual-band decomposition framework. Specifically, we employ two parallel decoder networks to learn the high-frequency edge and low-frequency body features from the low- and high-level features extracted from a two-steam multimodal backbone network, respectively. Next, we propose a cross-attention complementarity exploration module to enrich the edge/body features by exploiting the multimodal complementarity information. The refined features are then fed into our proposed color-hint guided fusion module for enhancing the depth feature and fusing the multimodal features. Finally, the resulting features are fused using our deeply supervised progressive fusion module, which progressively integrates edge and body features for predicting saliency maps. Our model explicitly considers the edge information for accurate RGB-D SOD, overcoming the limitations of existing methods and effectively improving the performance. Extensive experiments on benchmark datasets demonstrate that EM-Trans is an effective RGB-D SOD framework that outperforms the current state-of-the-art models, both quantitatively and qualitatively. A further extension to RGB-T SOD demonstrates the promising potential of our model in various kinds of multimodal SOD tasks.

AB - RGB-D salient object detection (SOD) has gained tremendous attention in recent years. In particular, transformer has been employed and shown great potential. However, existing transformer models usually overlook the vital edge information, which is a major issue restricting the further improvement of SOD accuracy. To this end, we propose a novel edge-aware RGB-D SOD transformer, called EM-Trans, which explicitly models the edge information in a dual-band decomposition framework. Specifically, we employ two parallel decoder networks to learn the high-frequency edge and low-frequency body features from the low- and high-level features extracted from a two-steam multimodal backbone network, respectively. Next, we propose a cross-attention complementarity exploration module to enrich the edge/body features by exploiting the multimodal complementarity information. The refined features are then fed into our proposed color-hint guided fusion module for enhancing the depth feature and fusing the multimodal features. Finally, the resulting features are fused using our deeply supervised progressive fusion module, which progressively integrates edge and body features for predicting saliency maps. Our model explicitly considers the edge information for accurate RGB-D SOD, overcoming the limitations of existing methods and effectively improving the performance. Extensive experiments on benchmark datasets demonstrate that EM-Trans is an effective RGB-D SOD framework that outperforms the current state-of-the-art models, both quantitatively and qualitatively. A further extension to RGB-T SOD demonstrates the promising potential of our model in various kinds of multimodal SOD tasks.

KW - Edge-aware model

KW - multimodal learning

KW - salient object detection (SOD)

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85187260239&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2024.3358858

DO - 10.1109/TNNLS.2024.3358858

M3 - 文章

AN - SCOPUS:85187260239

SN - 2162-237X

VL - 36

SP - 3175

EP - 3188

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 2

ER -

EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

摘要

访问文件

其它文件与链接

指纹

引用此