A Survey of Multimodal Learning: Methods, Applications, and Future

Yuan Yuan; Zhaojian Li; Bin Zhao

doi:10.1145/3713070

A Survey of Multimodal Learning: Methods, Applications, and Future

Yuan Yuan, Zhaojian Li, Bin Zhao

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

The multimodal interplay of the five fundamental senses-Sight, Hearing, Smell, Taste, and Touch-provides humans with superior environmental perception and learning skills. Adapted from the human perceptual system, multimodal machine learning tries to incorporate different forms of input, such as image, audio, and text, and determine their fundamental connections through joint modeling. As one of the future development forms of artificial intelligence, it is necessary to summarize the progress of multimodal machine learning. In this article, we start with the form of a multimodal combination and provide a comprehensive survey of the emerging subject of multimodal machine learning, covering representative research approaches, the most recent advancements, and their applications. Specifically, this article analyzes the relationship between different modalities in detail and sorts out the key issues in multimodal research from the application scenarios. Besides, we thoroughly reviewed state-of-The-Art methods and datasets covered in multimodal learning research. We then identify the substantial challenges and potential developing directions in this field. Finally, given the comprehensive nature of this survey, both modality-specific and task-specific researchers can benefit from this survey and advance the field.

源语言	英语
文章编号	167
期刊	ACM Computing Surveys
卷	57
期	7
DOI	https://doi.org/10.1145/3713070
出版状态	已出版 - 20 2月 2025

访问文件

10.1145/3713070

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{73523a412ac044518363068c94ac9695,

title = "A Survey of Multimodal Learning: Methods, Applications, and Future",

abstract = "The multimodal interplay of the five fundamental senses-Sight, Hearing, Smell, Taste, and Touch-provides humans with superior environmental perception and learning skills. Adapted from the human perceptual system, multimodal machine learning tries to incorporate different forms of input, such as image, audio, and text, and determine their fundamental connections through joint modeling. As one of the future development forms of artificial intelligence, it is necessary to summarize the progress of multimodal machine learning. In this article, we start with the form of a multimodal combination and provide a comprehensive survey of the emerging subject of multimodal machine learning, covering representative research approaches, the most recent advancements, and their applications. Specifically, this article analyzes the relationship between different modalities in detail and sorts out the key issues in multimodal research from the application scenarios. Besides, we thoroughly reviewed state-of-The-Art methods and datasets covered in multimodal learning research. We then identify the substantial challenges and potential developing directions in this field. Finally, given the comprehensive nature of this survey, both modality-specific and task-specific researchers can benefit from this survey and advance the field.",

keywords = "Multimodal, audio-visual learning, cross-modal, depth-visual, text-visual, touch-visual",

author = "Yuan Yuan and Zhaojian Li and Bin Zhao",

note = "Publisher Copyright: {\textcopyright} 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.",

year = "2025",

month = feb,

day = "20",

doi = "10.1145/3713070",

language = "英语",

volume = "57",

journal = "ACM Computing Surveys",

issn = "0360-0300",

publisher = "Association for Computing Machinery (ACM)",

number = "7",

}

TY - JOUR

T1 - A Survey of Multimodal Learning

T2 - Methods, Applications, and Future

AU - Yuan, Yuan

AU - Li, Zhaojian

AU - Zhao, Bin

PY - 2025/2/20

Y1 - 2025/2/20

N2 - The multimodal interplay of the five fundamental senses-Sight, Hearing, Smell, Taste, and Touch-provides humans with superior environmental perception and learning skills. Adapted from the human perceptual system, multimodal machine learning tries to incorporate different forms of input, such as image, audio, and text, and determine their fundamental connections through joint modeling. As one of the future development forms of artificial intelligence, it is necessary to summarize the progress of multimodal machine learning. In this article, we start with the form of a multimodal combination and provide a comprehensive survey of the emerging subject of multimodal machine learning, covering representative research approaches, the most recent advancements, and their applications. Specifically, this article analyzes the relationship between different modalities in detail and sorts out the key issues in multimodal research from the application scenarios. Besides, we thoroughly reviewed state-of-The-Art methods and datasets covered in multimodal learning research. We then identify the substantial challenges and potential developing directions in this field. Finally, given the comprehensive nature of this survey, both modality-specific and task-specific researchers can benefit from this survey and advance the field.

AB - The multimodal interplay of the five fundamental senses-Sight, Hearing, Smell, Taste, and Touch-provides humans with superior environmental perception and learning skills. Adapted from the human perceptual system, multimodal machine learning tries to incorporate different forms of input, such as image, audio, and text, and determine their fundamental connections through joint modeling. As one of the future development forms of artificial intelligence, it is necessary to summarize the progress of multimodal machine learning. In this article, we start with the form of a multimodal combination and provide a comprehensive survey of the emerging subject of multimodal machine learning, covering representative research approaches, the most recent advancements, and their applications. Specifically, this article analyzes the relationship between different modalities in detail and sorts out the key issues in multimodal research from the application scenarios. Besides, we thoroughly reviewed state-of-The-Art methods and datasets covered in multimodal learning research. We then identify the substantial challenges and potential developing directions in this field. Finally, given the comprehensive nature of this survey, both modality-specific and task-specific researchers can benefit from this survey and advance the field.

KW - Multimodal

KW - audio-visual learning

KW - cross-modal

KW - depth-visual

KW - text-visual

KW - touch-visual

UR - http://www.scopus.com/inward/record.url?scp=105000439996&partnerID=8YFLogxK

U2 - 10.1145/3713070

DO - 10.1145/3713070

M3 - 文章

AN - SCOPUS:105000439996

SN - 0360-0300

VL - 57

JO - ACM Computing Surveys

JF - ACM Computing Surveys

IS - 7

M1 - 167

ER -

A Survey of Multimodal Learning: Methods, Applications, and Future

摘要

访问文件

其它文件与链接

指纹

引用此