TAS: Personalized Text-guided Audio Spatialization

Zhaojian Li; Bin Zhao; Yuan Yuan

doi:10.1145/3664647.3681626

TAS: Personalized Text-guided Audio Spatialization

Zhaojian Li, Bin Zhao, Yuan Yuan

School of Artificial Intelligence, OPtics and Electronics

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.

Original language	English
Title of host publication	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	9029-9037
Number of pages	9
ISBN (Electronic)	9798400706868
DOIs	https://doi.org/10.1145/3664647.3681626
State	Published - 28 Oct 2024
Event	32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024

Publication series

Name	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference	32nd ACM International Conference on Multimedia, MM 2024
Country/Territory	Australia
City	Melbourne
Period	28/10/24 → 1/11/24

Keywords

audio spatialization
audio synthesis
multimodal learning
text-guided generation

Access to Document

10.1145/3664647.3681626

Cite this

@inproceedings{dc6e5f29d70a4e3da69c4ef0b072110f,

title = "TAS: Personalized Text-guided Audio Spatialization",

abstract = "Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.",

keywords = "audio spatialization, audio synthesis, multimodal learning, text-guided generation",

author = "Zhaojian Li and Bin Zhao and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 32nd ACM International Conference on Multimedia, MM 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3664647.3681626",

language = "英语",

series = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "9029--9037",

booktitle = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

}

Li, Z, Zhao, B & Yuan, Y 2024, TAS: Personalized Text-guided Audio Spatialization. in MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 9029-9037, 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, Australia, 28/10/24. https://doi.org/10.1145/3664647.3681626

TAS: Personalized Text-guided Audio Spatialization. / Li, Zhaojian; Zhao, Bin ; Yuan, Yuan.
MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2024. p. 9029-9037 (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - TAS

T2 - 32nd ACM International Conference on Multimedia, MM 2024

AU - Li, Zhaojian

AU - Zhao, Bin

AU - Yuan, Yuan

PY - 2024/10/28

Y1 - 2024/10/28

N2 - Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.

AB - Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.

KW - audio spatialization

KW - audio synthesis

KW - multimodal learning

KW - text-guided generation

UR - http://www.scopus.com/inward/record.url?scp=85209803789&partnerID=8YFLogxK

U2 - 10.1145/3664647.3681626

DO - 10.1145/3664647.3681626

M3 - 会议稿件

AN - SCOPUS:85209803789

T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

SP - 9029

EP - 9037

BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 28 October 2024 through 1 November 2024

ER -

TAS: Personalized Text-guided Audio Spatialization

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this