Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text

  • Hongfei Xue
  • , Wei Ren
  • , Xuelong Geng
  • , Kun Wei
  • , Longhao Li
  • , Qijie Shao
  • , Linju Yang
  • , Kai Diao
  • , Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Integrating audio encoders with LLMs has enabled models to process audio, enhancing speech-to-text tasks including automatic speech recognition (ASR) and automatic speech translation (AST). However, these methods often overlook language adaptation in multilingual settings, relying on multilingual data without adequately addressing language differences. To address this gap, we propose the Ideal-LLM model, which employs dual multilingual encoders to enrich language features and uses a language-adapted connector to target each language. By leveraging the complementary strengths of Whisper and MMS encoders, our approach ensures richer multilingual representations. Additionally, the connector enhances modal transformation via a weight selector tailored for each language. Experimental results demonstrate that Ideal-LLM improves ASR performance, achieving a 32.6% relative reduction in word error rates compared to the standard speech encoder integrated with LLMs and yields an average BLEU score of 36.78 for AST.

Original languageEnglish
Title of host publicationMan-Machine Speech Communication - 20th National Conference, NCMMSC 2025, Proceedings
EditorsJia Jia, Zhiyong Wu, Lijian Gao, Gongping Huang, Ya Li
PublisherSpringer Science and Business Media Deutschland GmbH
Pages47-58
Number of pages12
ISBN (Print)9789819553815
DOIs
StatePublished - 2026
Event20th National Conference on Man-Machine Speech Communication, NCMMSC 2025 - Zhenjiang, China
Duration: 16 Oct 202519 Oct 2025

Publication series

NameCommunications in Computer and Information Science
Volume2662 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference20th National Conference on Man-Machine Speech Communication, NCMMSC 2025
Country/TerritoryChina
CityZhenjiang
Period16/10/2519/10/25

Keywords

  • Dual Encoders
  • Large Language Models
  • Multilingual Speech-to-Text

Fingerprint

Dive into the research topics of 'Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text'. Together they form a unique fingerprint.

Cite this