Self-validated Story Segmentation of Chinese Broadcast News

Wei Feng, Lei Xie, Jin Zhang, Yujun Zhang, Yanning Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Automatic story segmentation is an important prerequisite for semantic-level applications. The normalized cuts (NCuts) method has recently shown great promise for segmenting English spoken lectures. However, the availability assumption of the exact story number per file significantly limits its capability to handle a large number of transcripts. Besides, how to apply such method to Chinese language in the presence of speech recognition errors is unclear yet. Addressesing these two problems, we propose a self-validated NCuts (SNCuts) algorithm for segmenting Chinese broadcast news via inaccurate lexical cues, generated by the Chinese large vocabulary continuous speech recognizer (LVCSR). Due to the specialty of Chinese language, we present a subword-level graph embedding for the erroneous LVCSR transcripts. We regularize the NCuts criterion by a general exponential prior of story numbers, respecting the principle of Occam’s razor. Given the maximum story number as a general parameter, we can automatically obtain reasonable segmentations for a large number of news transcripts, with the story numbers automatically determined for each file, and with comparable complexity to alternative non-self-validated methods. Extensive experiments on benchmark corpus show that: (i) the proposed SNCuts algorithm can efficiently produce comparable or even better segmentation quality, as compared to other state-of-the-art methods with true story number as an input parameter; and (ii) the subword-level embedding always helps to recovering lexical cohesion in Chinese erroneous transcripts, thus improving both segmentation accuracy and robustness to LVCSR errors.

Original languageEnglish
Title of host publicationAdvances in Brain Inspired Cognitive Systems - 9th International Conference, BICS 2018, Proceedings
EditorsAmir Hussain, Bin Luo, Jiangbin Zheng, Xinbo Zhao, Cheng-Lin Liu, Jinchang Ren, Huimin Zhao
PublisherSpringer Verlag
Pages568-578
Number of pages11
ISBN (Print)9783030005627
DOIs
StatePublished - 2018
Event9th International Conference on Brain-Inspired Cognitive Systems, BICS 2018 - Xi'an, China
Duration: 7 Jul 20188 Jul 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10989 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference9th International Conference on Brain-Inspired Cognitive Systems, BICS 2018
Country/TerritoryChina
CityXi'an
Period7/07/188/07/18

Keywords

  • Chinese broadcast news
  • Normalized cuts
  • Self-validation
  • Story segmentation
  • Subwords
  • Topic detection

Fingerprint

Dive into the research topics of 'Self-validated Story Segmentation of Chinese Broadcast News'. Together they form a unique fingerprint.

Cite this