Skip to main navigation Skip to search Skip to main content

Visual-Audio-based Fusion Network via Enhanced Transformer for Depression Detection

  • Shaanxi University of Chinese Medicine
  • Chongqing Normal University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Depression, as a common psychological disorder, poses a potential threat to public safety in society. Although many approaches have been proposed to address the long tail distribution issue, these methods still face challenges in modeling long-term dependencies and feature selection. To address these issues, this paper proposes a visual-audio fusion network framework via an enhanced transformer. Concisely, a learnable Multimodal Alignment Module (MAM) is designed to uniformly map video and audio features to a consistent spatiotemporal resolution. Then, a bidirectional Crossmodal Interaction Module (CIM) is introduced to enable video and audio to query/context to each other, achieving fine-grained and symmetrical semantic acoustic coupling modeling. Finally, we design an Enhanced Transformer Module (ETM), which combines a randomly deep, regularized Transformer backbone with dynamic absolute position encoding, thereby improving generalization and adaptability to variable-length inputs in small-sample scenarios while enhancing the ability to model long-term dependencies. Extensive quantitative experiments on public datasets show that our method achieves higher classification accuracy and precision in depression classification than existing methods.

Original languageEnglish
Title of host publicationInternational Conference on Machine Learning and Artificial Intelligence Applications, MLAIA 2025
EditorsJianhua Zhou
PublisherSPIE
ISBN (Electronic)9798902322276
DOIs
StatePublished - 9 Mar 2026
EventInternational Conference on Machine Learning and Artificial Intelligence Applications, MLAIA 2025 - Shaoyang, China
Duration: 12 Dec 202514 Dec 2025

Publication series

NameProceedings of SPIE - The International Society for Optical Engineering
Volume14134
ISSN (Print)0277-786X
ISSN (Electronic)1996-756X

Conference

ConferenceInternational Conference on Machine Learning and Artificial Intelligence Applications, MLAIA 2025
Country/TerritoryChina
CityShaoyang
Period12/12/2514/12/25

Keywords

  • Deep Learning
  • Depression Recognition
  • Information Fusion

Fingerprint

Dive into the research topics of 'Visual-Audio-based Fusion Network via Enhanced Transformer for Depression Detection'. Together they form a unique fingerprint.

Cite this