TY - JOUR
T1 - Modeling latent topics and temporal distance for story segmentation of broadcast news
AU - Chen, Hongjie
AU - Xie, Lei
AU - Leung, Cheung Chi
AU - Lu, Xiaoming
AU - Ma, Bin
AU - Li, Haizhou
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2017/1
Y1 - 2017/1
N2 - This paper studies a strategy to model latent topics and temporal distance of text blocks for story segmentation, that we call graph regularization in topic modeling or GRTM. We propose two novel approaches that consider both temporal distance and lexical similarity of text blocks, collectively referred to as data proximity, in learning latent topic representation, where a graph regularizer is involved to derive the latent topic representation while preserving data proximity. In the first approach, we extend the idea of Laplacian probabilistic latent semantic analysis (LapPLSA) by introducing a distance penalty function in the affinity matrix of a graph for latent topic estimation. The estimated latent topic distributions are used to replace the traditional term-frequency vectors as the data representation of the text blocks and to measure the cohesive strength between them. In the second approach, we perform Laplacian eigenmaps, which makes use of the graph regularizer for dimensionality reduction, on latent topic distributions estimated by conventional topic modeling. We conduct the experiments on the automatic speech recognition transcripts of the TDT2 English broadcast news corpus. The experiments show the proposed strategy outperforms the conventional techniques. LapPLSA performs the best with the highest F1-measure of 0.816. The effects of the penalty constant in the distance penalty function, the number of latent topics, and the size of training data on the segmentation performances are also studied.
AB - This paper studies a strategy to model latent topics and temporal distance of text blocks for story segmentation, that we call graph regularization in topic modeling or GRTM. We propose two novel approaches that consider both temporal distance and lexical similarity of text blocks, collectively referred to as data proximity, in learning latent topic representation, where a graph regularizer is involved to derive the latent topic representation while preserving data proximity. In the first approach, we extend the idea of Laplacian probabilistic latent semantic analysis (LapPLSA) by introducing a distance penalty function in the affinity matrix of a graph for latent topic estimation. The estimated latent topic distributions are used to replace the traditional term-frequency vectors as the data representation of the text blocks and to measure the cohesive strength between them. In the second approach, we perform Laplacian eigenmaps, which makes use of the graph regularizer for dimensionality reduction, on latent topic distributions estimated by conventional topic modeling. We conduct the experiments on the automatic speech recognition transcripts of the TDT2 English broadcast news corpus. The experiments show the proposed strategy outperforms the conventional techniques. LapPLSA performs the best with the highest F1-measure of 0.816. The effects of the penalty constant in the distance penalty function, the number of latent topics, and the size of training data on the segmentation performances are also studied.
KW - Graph regularization
KW - Laplacian eigenmaps
KW - Laplacian probabilistic latent semantic analysis
KW - topic modeling
KW - topic segmentation
UR - http://www.scopus.com/inward/record.url?scp=85002616454&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2016.2626965
DO - 10.1109/TASLP.2016.2626965
M3 - 文章
AN - SCOPUS:85002616454
SN - 2329-9290
VL - 25
SP - 108
EP - 119
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
IS - 1
ER -