Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Ziqian Ning; Shuai Wang; Yuepeng Jiang; Jixun Yao; Lei He; Shifeng Pan; Jie Ding; Lei Xie

doi:10.1609/aaai.v39i23.34680

Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Ziqian Ning, Shuai Wang, Yuepeng Jiang, Jixun Yao, Lei He, Shifeng Pan, Jie Ding, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

摘要

Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

源语言	英语
页（从-至）	24966-24974
页数	9
期刊	Proceedings of the AAAI Conference on Artificial Intelligence
卷	39
期	23
DOI	https://doi.org/10.1609/aaai.v39i23.34680
出版状态	已出版 - 11 4月 2025
活动	39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, 美国期限: 25 2月 2025 → 4 3月 2025

访问文件

10.1609/aaai.v39i23.34680

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{e9add6eaf27c4a24bbedf329eb2fb53c,

title = "Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation",

abstract = "Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.",

author = "Ziqian Ning and Shuai Wang and Yuepeng Jiang and Jixun Yao and Lei He and Shifeng Pan and Jie Ding and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 ; Conference date: 25-02-2025 Through 04-03-2025",

year = "2025",

month = apr,

day = "11",

doi = "10.1609/aaai.v39i23.34680",

language = "英语",

volume = "39",

pages = "24966--24974",

journal = "Proceedings of the AAAI Conference on Artificial Intelligence",

issn = "2159-5399",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "23",

}

TY - JOUR

T1 - Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

AU - Ning, Ziqian

AU - Wang, Shuai

AU - Jiang, Yuepeng

AU - Yao, Jixun

AU - He, Lei

AU - Pan, Shifeng

AU - Ding, Jie

AU - Xie, Lei

PY - 2025/4/11

Y1 - 2025/4/11

N2 - Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

AB - Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

UR - http://www.scopus.com/inward/record.url?scp=105004204362&partnerID=8YFLogxK

U2 - 10.1609/aaai.v39i23.34680

DO - 10.1609/aaai.v39i23.34680

M3 - 会议文章

AN - SCOPUS:105004204362

SN - 2159-5399

VL - 39

SP - 24966

EP - 24974

JO - Proceedings of the AAAI Conference on Artificial Intelligence

JF - Proceedings of the AAAI Conference on Artificial Intelligence

IS - 23

T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

Y2 - 25 February 2025 through 4 March 2025

ER -

Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

摘要

访问文件

其它文件与链接

指纹

引用此