TAS: Personalized Text-guided Audio Spatialization

Zhaojian Li; Bin Zhao; Yuan Yuan

doi:10.1145/3664647.3681626

TAS: Personalized Text-guided Audio Spatialization

Zhaojian Li, Bin Zhao, Yuan Yuan

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.

源语言	英语
主期刊名	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	9029-9037
页数	9
ISBN（电子版）	9798400706868
DOI	https://doi.org/10.1145/3664647.3681626
出版状态	已出版 - 28 10月 2024
活动	32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, 澳大利亚期限: 28 10月 2024 → 1 11月 2024

出版系列

姓名	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

会议

会议	32nd ACM International Conference on Multimedia, MM 2024
国家/地区	澳大利亚
市	Melbourne
时期	28/10/24 → 1/11/24

访问文件

10.1145/3664647.3681626

其它文件与链接

链接到 Scopus 的出版物

引用此

@inproceedings{dc6e5f29d70a4e3da69c4ef0b072110f,

title = "TAS: Personalized Text-guided Audio Spatialization",

abstract = "Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.",

keywords = "audio spatialization, audio synthesis, multimodal learning, text-guided generation",

author = "Zhaojian Li and Bin Zhao and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 32nd ACM International Conference on Multimedia, MM 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3664647.3681626",

language = "英语",

series = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "9029--9037",

booktitle = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

}

Li, Z, Zhao, B & Yuan, Y 2024, TAS: Personalized Text-guided Audio Spatialization. 在 MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 9029-9037, 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, 澳大利亚, 28/10/24. https://doi.org/10.1145/3664647.3681626

TAS: Personalized Text-guided Audio Spatialization. / Li, Zhaojian; Zhao, Bin ; Yuan, Yuan.
MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2024. 页码 9029-9037 (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - TAS

T2 - 32nd ACM International Conference on Multimedia, MM 2024

AU - Li, Zhaojian

AU - Zhao, Bin

AU - Yuan, Yuan

PY - 2024/10/28

Y1 - 2024/10/28

N2 - Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.

AB - Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.

KW - audio spatialization

KW - audio synthesis

KW - multimodal learning

KW - text-guided generation

UR - http://www.scopus.com/inward/record.url?scp=85209803789&partnerID=8YFLogxK

U2 - 10.1145/3664647.3681626

DO - 10.1145/3664647.3681626

M3 - 会议稿件

AN - SCOPUS:85209803789

T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

SP - 9029

EP - 9037

BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 28 October 2024 through 1 November 2024

ER -

TAS: Personalized Text-guided Audio Spatialization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此