SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

Hongfei Xue; Qijie Shao; Kaixun Huang; Peikun Chen; Jie Liu; Lei Xie

doi:10.1109/ICME57554.2024.10687681

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

Hongfei Xue, Qijie Shao, Kaixun Huang, Peikun Chen, Jie Liu, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.

源语言	英语
主期刊名	2024 IEEE International Conference on Multimedia and Expo, ICME 2024
出版商	IEEE Computer Society
ISBN（电子版）	9798350390155
DOI	https://doi.org/10.1109/ICME57554.2024.10687681
出版状态	已出版 - 2024
活动	2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, 加拿大期限: 15 7月 2024 → 19 7月 2024

出版系列

姓名	Proceedings - IEEE International Conference on Multimedia and Expo
ISSN（印刷版）	1945-7871
ISSN（电子版）	1945-788X

会议

会议	2024 IEEE International Conference on Multimedia and Expo, ICME 2024
国家/地区	加拿大
市	Niagra Falls
时期	15/07/24 → 19/07/24

访问文件

10.1109/ICME57554.2024.10687681

其它文件与链接

链接到 Scopus 的出版物

引用此

Xue, H., Shao, Q., Huang, K., Chen, P., Liu, J., & Xie, L. (2024). SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition. 在 2024 IEEE International Conference on Multimedia and Expo, ICME 2024 (Proceedings - IEEE International Conference on Multimedia and Expo). IEEE Computer Society. https://doi.org/10.1109/ICME57554.2024.10687681

@inproceedings{44ff9a80d6a34bf4a7c533b5a2b4678d,

title = "SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition",

abstract = "Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.",

keywords = "low-resource ASR, Multilingual ASR, representation analysis, self-supervised learning",

author = "Hongfei Xue and Qijie Shao and Kaixun Huang and Peikun Chen and Jie Liu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE International Conference on Multimedia and Expo, ICME 2024 ; Conference date: 15-07-2024 Through 19-07-2024",

year = "2024",

doi = "10.1109/ICME57554.2024.10687681",

language = "英语",

series = "Proceedings - IEEE International Conference on Multimedia and Expo",

publisher = "IEEE Computer Society",

booktitle = "2024 IEEE International Conference on Multimedia and Expo, ICME 2024",

}

Xue, H, Shao, Q, Huang, K, Chen, P, Liu, J & Xie, L 2024, SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition. 在 2024 IEEE International Conference on Multimedia and Expo, ICME 2024. Proceedings - IEEE International Conference on Multimedia and Expo, IEEE Computer Society, 2024 IEEE International Conference on Multimedia and Expo, ICME 2024, Niagra Falls, 加拿大, 15/07/24. https://doi.org/10.1109/ICME57554.2024.10687681

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition. / Xue, Hongfei; Shao, Qijie; Huang, Kaixun 等.
2024 IEEE International Conference on Multimedia and Expo, ICME 2024. IEEE Computer Society, 2024. (Proceedings - IEEE International Conference on Multimedia and Expo).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - SSHR

T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024

AU - Xue, Hongfei

AU - Shao, Qijie

AU - Huang, Kaixun

AU - Chen, Peikun

AU - Liu, Jie

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.

AB - Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.

KW - low-resource ASR

KW - Multilingual ASR

KW - representation analysis

KW - self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85206571685&partnerID=8YFLogxK

U2 - 10.1109/ICME57554.2024.10687681

DO - 10.1109/ICME57554.2024.10687681

M3 - 会议稿件

AN - SCOPUS:85206571685

T3 - Proceedings - IEEE International Conference on Multimedia and Expo

BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024

PB - IEEE Computer Society

Y2 - 15 July 2024 through 19 July 2024

ER -

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此