Probabilistic latent semantic analysis for broadcast news story segmentation

Mimi Lu; Cheung Chi Leung; Lei Xie; Bin Ma; Haizhou Li

Probabilistic latent semantic analysis for broadcast news story segmentation

Mimi Lu, Cheung Chi Leung, Lei Xie, Bin Ma, Haizhou Li

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

13 Scopus citations

Abstract

This paper proposes to perform probabilistic latent semantic analysis (PLSA) for broadcast news (BN) story segmentation. PLSA exploits a deeper underlying relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal term matching. Different from text segmentation, lexical based BN story segmentation has to be carried out over LVCSR transcripts, where the incorrect recognition of out-of-vocabulary words inevitably impacts the semantic relation. We use phoneme subwords as the basic term units to address this problem. We integrate a cross entropy measurement with PLSA to depict lexical cohesion and compare its performance with the widely used cosine similarity metric. Furthermore, we evaluate two approaches, namely TextTiling and dynamic programming (DP), for story boundary identification. Experimental results show that the PLSA based methods bring a significant performance boost to story segmentation and the cross entropy based DP approach provides the best performance.

Original language	English
Pages (from-to)	1109-1112
Number of pages	4
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
State	Published - 2011
Event	12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011 - Florence, Italy Duration: 27 Aug 2011 → 31 Aug 2011

Keywords

Cross entropy
Dynamic
Probabilistic latent semantic analysis
Spoken document retrieval
Story segmentation

Cite this

@article{c9eb01dc1cdc4aee8bf912d0d5edd221,

title = "Probabilistic latent semantic analysis for broadcast news story segmentation",

abstract = "This paper proposes to perform probabilistic latent semantic analysis (PLSA) for broadcast news (BN) story segmentation. PLSA exploits a deeper underlying relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal term matching. Different from text segmentation, lexical based BN story segmentation has to be carried out over LVCSR transcripts, where the incorrect recognition of out-of-vocabulary words inevitably impacts the semantic relation. We use phoneme subwords as the basic term units to address this problem. We integrate a cross entropy measurement with PLSA to depict lexical cohesion and compare its performance with the widely used cosine similarity metric. Furthermore, we evaluate two approaches, namely TextTiling and dynamic programming (DP), for story boundary identification. Experimental results show that the PLSA based methods bring a significant performance boost to story segmentation and the cross entropy based DP approach provides the best performance.",

keywords = "Cross entropy, Dynamic, Probabilistic latent semantic analysis, Spoken document retrieval, Story segmentation",

author = "Mimi Lu and Leung, {Cheung Chi} and Lei Xie and Bin Ma and Haizhou Li",

year = "2011",

language = "英语",

pages = "1109--1112",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

note = "12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011 ; Conference date: 27-08-2011 Through 31-08-2011",

}

TY - JOUR

T1 - Probabilistic latent semantic analysis for broadcast news story segmentation

AU - Lu, Mimi

AU - Leung, Cheung Chi

AU - Xie, Lei

AU - Ma, Bin

AU - Li, Haizhou

PY - 2011

Y1 - 2011

N2 - This paper proposes to perform probabilistic latent semantic analysis (PLSA) for broadcast news (BN) story segmentation. PLSA exploits a deeper underlying relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal term matching. Different from text segmentation, lexical based BN story segmentation has to be carried out over LVCSR transcripts, where the incorrect recognition of out-of-vocabulary words inevitably impacts the semantic relation. We use phoneme subwords as the basic term units to address this problem. We integrate a cross entropy measurement with PLSA to depict lexical cohesion and compare its performance with the widely used cosine similarity metric. Furthermore, we evaluate two approaches, namely TextTiling and dynamic programming (DP), for story boundary identification. Experimental results show that the PLSA based methods bring a significant performance boost to story segmentation and the cross entropy based DP approach provides the best performance.

AB - This paper proposes to perform probabilistic latent semantic analysis (PLSA) for broadcast news (BN) story segmentation. PLSA exploits a deeper underlying relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal term matching. Different from text segmentation, lexical based BN story segmentation has to be carried out over LVCSR transcripts, where the incorrect recognition of out-of-vocabulary words inevitably impacts the semantic relation. We use phoneme subwords as the basic term units to address this problem. We integrate a cross entropy measurement with PLSA to depict lexical cohesion and compare its performance with the widely used cosine similarity metric. Furthermore, we evaluate two approaches, namely TextTiling and dynamic programming (DP), for story boundary identification. Experimental results show that the PLSA based methods bring a significant performance boost to story segmentation and the cross entropy based DP approach provides the best performance.

KW - Cross entropy

KW - Dynamic

KW - Probabilistic latent semantic analysis

KW - Spoken document retrieval

KW - Story segmentation

UR - http://www.scopus.com/inward/record.url?scp=84865703778&partnerID=8YFLogxK

M3 - 会议文章

AN - SCOPUS:84865703778

SN - 2308-457X

SP - 1109

EP - 1112

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011

Y2 - 27 August 2011 through 31 August 2011

ER -

Probabilistic latent semantic analysis for broadcast news story segmentation

Abstract

Keywords

Other files and links

Fingerprint

Cite this