SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

  • Hongfei Xue
  • , Qijie Shao
  • , Kaixun Huang
  • , Peikun Chen
  • , Jie Liu
  • , Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PublisherIEEE Computer Society
ISBN (Electronic)9798350390155
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, Canada
Duration: 15 Jul 202419 Jul 2024

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Country/TerritoryCanada
CityNiagra Falls
Period15/07/2419/07/24

Keywords

  • Multilingual ASR
  • low-resource ASR
  • representation analysis
  • self-supervised learning

Fingerprint

Dive into the research topics of 'SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition'. Together they form a unique fingerprint.

Cite this