TY - GEN
T1 - Visual-Audio-based Fusion Network via Enhanced Transformer for Depression Detection
AU - Bai, Yunfeng
AU - Fang, Aiqing
AU - Li, Ying
N1 - Publisher Copyright:
© 2026 SPIE.
PY - 2026/3/9
Y1 - 2026/3/9
N2 - Depression, as a common psychological disorder, poses a potential threat to public safety in society. Although many approaches have been proposed to address the long tail distribution issue, these methods still face challenges in modeling long-term dependencies and feature selection. To address these issues, this paper proposes a visual-audio fusion network framework via an enhanced transformer. Concisely, a learnable Multimodal Alignment Module (MAM) is designed to uniformly map video and audio features to a consistent spatiotemporal resolution. Then, a bidirectional Crossmodal Interaction Module (CIM) is introduced to enable video and audio to query/context to each other, achieving fine-grained and symmetrical semantic acoustic coupling modeling. Finally, we design an Enhanced Transformer Module (ETM), which combines a randomly deep, regularized Transformer backbone with dynamic absolute position encoding, thereby improving generalization and adaptability to variable-length inputs in small-sample scenarios while enhancing the ability to model long-term dependencies. Extensive quantitative experiments on public datasets show that our method achieves higher classification accuracy and precision in depression classification than existing methods.
AB - Depression, as a common psychological disorder, poses a potential threat to public safety in society. Although many approaches have been proposed to address the long tail distribution issue, these methods still face challenges in modeling long-term dependencies and feature selection. To address these issues, this paper proposes a visual-audio fusion network framework via an enhanced transformer. Concisely, a learnable Multimodal Alignment Module (MAM) is designed to uniformly map video and audio features to a consistent spatiotemporal resolution. Then, a bidirectional Crossmodal Interaction Module (CIM) is introduced to enable video and audio to query/context to each other, achieving fine-grained and symmetrical semantic acoustic coupling modeling. Finally, we design an Enhanced Transformer Module (ETM), which combines a randomly deep, regularized Transformer backbone with dynamic absolute position encoding, thereby improving generalization and adaptability to variable-length inputs in small-sample scenarios while enhancing the ability to model long-term dependencies. Extensive quantitative experiments on public datasets show that our method achieves higher classification accuracy and precision in depression classification than existing methods.
KW - Deep Learning
KW - Depression Recognition
KW - Information Fusion
UR - https://www.scopus.com/pages/publications/105034712882
U2 - 10.1117/12.3110663
DO - 10.1117/12.3110663
M3 - 会议稿件
AN - SCOPUS:105034712882
T3 - Proceedings of SPIE - The International Society for Optical Engineering
BT - International Conference on Machine Learning and Artificial Intelligence Applications, MLAIA 2025
A2 - Zhou, Jianhua
PB - SPIE
T2 - International Conference on Machine Learning and Artificial Intelligence Applications, MLAIA 2025
Y2 - 12 December 2025 through 14 December 2025
ER -