On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

L. Xie; Y. L. Yang; Z. Q. Liu

doi:10.1016/j.ins.2011.02.013

On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

L. Xie, Y. L. Yang, Z. Q. Liu

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

26 引用（Scopus）

摘要

Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n = 1, 2, 3, 4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.

源语言	英语
页（从-至）	2873-2891
页数	19
期刊	Information Sciences
卷	181
期	13
DOI	https://doi.org/10.1016/j.ins.2011.02.013
出版状态	已出版 - 1 7月 2011

访问文件

10.1016/j.ins.2011.02.013

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{f3a49071999e461c856fb87c7626edf2,

title = "On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news",

abstract = "Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n = 1, 2, 3, 4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.",

keywords = "Lexical cohesion, Spoken document retrieval, Story segmentation, Subwords, Topic detection and tracking, Topic segmentation",

author = "L. Xie and Yang, {Y. L.} and Liu, {Z. Q.}",

year = "2011",

month = jul,

day = "1",

doi = "10.1016/j.ins.2011.02.013",

language = "英语",

volume = "181",

pages = "2873--2891",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier Inc.",

number = "13",

}

TY - JOUR

T1 - On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

AU - Xie, L.

AU - Yang, Y. L.

AU - Liu, Z. Q.

PY - 2011/7/1

Y1 - 2011/7/1

N2 - Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n = 1, 2, 3, 4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.

AB - Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n = 1, 2, 3, 4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.

KW - Lexical cohesion

KW - Spoken document retrieval

KW - Story segmentation

KW - Subwords

KW - Topic detection and tracking

KW - Topic segmentation

UR - http://www.scopus.com/inward/record.url?scp=79953862205&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2011.02.013

DO - 10.1016/j.ins.2011.02.013

M3 - 文章

AN - SCOPUS:79953862205

SN - 0020-0255

VL - 181

SP - 2873

EP - 2891

JO - Information Sciences

JF - Information Sciences

IS - 13

ER -

On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

摘要

访问文件

其它文件与链接

指纹

引用此