Skip to main navigation Skip to search Skip to main content

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

  • Hongfei Xue
  • , Yufeng Tang
  • , Jun Zhang
  • , Xuelong Geng
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • ByteDance Ltd.

Research output: Contribution to journalConference articlepeer-review

Abstract

Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications.

Original languageEnglish
Pages (from-to)2580-2584
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • multilingual ASR
  • selective invocation model
  • spoken large language models

Fingerprint

Dive into the research topics of 'Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty'. Together they form a unique fingerprint.

Cite this