TY - JOUR
T1 - Fusion-Embedding Siamese Network for Light Field Salient Object Detection
AU - Chen, Geng
AU - Fu, Huazhu
AU - Zhou, Tao
AU - Xiao, Guobao
AU - Fu, Keren
AU - Xia, Yong
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Light field salient object detection (SOD) has shown remarkable success and gained considerable attention from the computer vision community. Existing methods usually employ a single-/two-stream network to detect saliency. However, these methods can only handle up to two different modalities at a time, preventing them from being able to fully explore the rich information in multi-modal light field derived data. To address this, we propose the first joint multi-modal learning framework, called FES-Net, for light field SOD, which can take rich inputs not limited to two modalities. Specifically, we propose an attention-aware adaptation module to first transform the multi-modal inputs for use in our joint learning framework. The transformed inputs are then fed to a Siamese network along with multiple embedded feature fusion modules to extract informative multi-modal features. Finally, we predict saliency maps from the high-level extracted features using a saliency decoder module. Our joint multi-modal learning framework effectively resolves the limitations of existing methods, providing efficient and effective multi-modal learning that can fully explore the valuable information in light field data for accurate saliency detection. Furthermore, we improve the performance by introducing the Transformer as our backbone network. To the best of our knowledge, the improved version of our model, called FES-Trans, is the first attempt to address the challenging light field SOD with the powerful Transformer technique. Extensive experiments on benchmark datasets demonstrate that our models are superior light field SOD approaches and outperform cutting-edge models remarkably.
AB - Light field salient object detection (SOD) has shown remarkable success and gained considerable attention from the computer vision community. Existing methods usually employ a single-/two-stream network to detect saliency. However, these methods can only handle up to two different modalities at a time, preventing them from being able to fully explore the rich information in multi-modal light field derived data. To address this, we propose the first joint multi-modal learning framework, called FES-Net, for light field SOD, which can take rich inputs not limited to two modalities. Specifically, we propose an attention-aware adaptation module to first transform the multi-modal inputs for use in our joint learning framework. The transformed inputs are then fed to a Siamese network along with multiple embedded feature fusion modules to extract informative multi-modal features. Finally, we predict saliency maps from the high-level extracted features using a saliency decoder module. Our joint multi-modal learning framework effectively resolves the limitations of existing methods, providing efficient and effective multi-modal learning that can fully explore the valuable information in light field data for accurate saliency detection. Furthermore, we improve the performance by introducing the Transformer as our backbone network. To the best of our knowledge, the improved version of our model, called FES-Trans, is the first attempt to address the challenging light field SOD with the powerful Transformer technique. Extensive experiments on benchmark datasets demonstrate that our models are superior light field SOD approaches and outperform cutting-edge models remarkably.
KW - Light field
KW - multi-modal learning
KW - salient object detection
KW - siamese network
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85159839633&partnerID=8YFLogxK
U2 - 10.1109/TMM.2023.3274933
DO - 10.1109/TMM.2023.3274933
M3 - 文章
AN - SCOPUS:85159839633
SN - 1520-9210
VL - 26
SP - 984
EP - 994
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -