Self-validated Story Segmentation of Chinese Broadcast News

Wei Feng, Lei Xie, Jin Zhang, Yujun Zhang, Yanning Zhang

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Automatic story segmentation is an important prerequisite for semantic-level applications. The normalized cuts (NCuts) method has recently shown great promise for segmenting English spoken lectures. However, the availability assumption of the exact story number per file significantly limits its capability to handle a large number of transcripts. Besides, how to apply such method to Chinese language in the presence of speech recognition errors is unclear yet. Addressesing these two problems, we propose a self-validated NCuts (SNCuts) algorithm for segmenting Chinese broadcast news via inaccurate lexical cues, generated by the Chinese large vocabulary continuous speech recognizer (LVCSR). Due to the specialty of Chinese language, we present a subword-level graph embedding for the erroneous LVCSR transcripts. We regularize the NCuts criterion by a general exponential prior of story numbers, respecting the principle of Occam’s razor. Given the maximum story number as a general parameter, we can automatically obtain reasonable segmentations for a large number of news transcripts, with the story numbers automatically determined for each file, and with comparable complexity to alternative non-self-validated methods. Extensive experiments on benchmark corpus show that: (i) the proposed SNCuts algorithm can efficiently produce comparable or even better segmentation quality, as compared to other state-of-the-art methods with true story number as an input parameter; and (ii) the subword-level embedding always helps to recovering lexical cohesion in Chinese erroneous transcripts, thus improving both segmentation accuracy and robustness to LVCSR errors.

源语言英语
主期刊名Advances in Brain Inspired Cognitive Systems - 9th International Conference, BICS 2018, Proceedings
编辑Amir Hussain, Bin Luo, Jiangbin Zheng, Xinbo Zhao, Cheng-Lin Liu, Jinchang Ren, Huimin Zhao
出版商Springer Verlag
568-578
页数11
ISBN(印刷版)9783030005627
DOI
出版状态已出版 - 2018
活动9th International Conference on Brain-Inspired Cognitive Systems, BICS 2018 - Xi'an, 中国
期限: 7 7月 20188 7月 2018

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
10989 LNAI
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议9th International Conference on Brain-Inspired Cognitive Systems, BICS 2018
国家/地区中国
Xi'an
时期7/07/188/07/18

指纹

探究 'Self-validated Story Segmentation of Chinese Broadcast News' 的科研主题。它们共同构成独一无二的指纹。

引用此