Probabilistic latent semantic analysis for broadcast news story segmentation

Mimi Lu, Cheung Chi Leung, Lei Xie, Bin Ma, Haizhou Li

Research output: Contribution to journalConference articlepeer-review

13 Scopus citations

Abstract

This paper proposes to perform probabilistic latent semantic analysis (PLSA) for broadcast news (BN) story segmentation. PLSA exploits a deeper underlying relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal term matching. Different from text segmentation, lexical based BN story segmentation has to be carried out over LVCSR transcripts, where the incorrect recognition of out-of-vocabulary words inevitably impacts the semantic relation. We use phoneme subwords as the basic term units to address this problem. We integrate a cross entropy measurement with PLSA to depict lexical cohesion and compare its performance with the widely used cosine similarity metric. Furthermore, we evaluate two approaches, namely TextTiling and dynamic programming (DP), for story boundary identification. Experimental results show that the PLSA based methods bring a significant performance boost to story segmentation and the cross entropy based DP approach provides the best performance.

Original languageEnglish
Pages (from-to)1109-1112
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - 2011
Event12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011 - Florence, Italy
Duration: 27 Aug 201131 Aug 2011

Keywords

  • Cross entropy
  • Dynamic
  • Probabilistic latent semantic analysis
  • Spoken document retrieval
  • Story segmentation

Fingerprint

Dive into the research topics of 'Probabilistic latent semantic analysis for broadcast news story segmentation'. Together they form a unique fingerprint.

Cite this