Modeling latent topics and temporal distance for story segmentation of broadcast news

Hongjie Chen; Lei Xie; Cheung Chi Leung; Xiaoming Lu; Bin Ma; Haizhou Li

doi:10.1109/TASLP.2016.2626965

Modeling latent topics and temporal distance for story segmentation of broadcast news

Hongjie Chen, Lei Xie, Cheung Chi Leung, Xiaoming Lu, Bin Ma, Haizhou Li

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

15 引用（Scopus）

摘要

This paper studies a strategy to model latent topics and temporal distance of text blocks for story segmentation, that we call graph regularization in topic modeling or GRTM. We propose two novel approaches that consider both temporal distance and lexical similarity of text blocks, collectively referred to as data proximity, in learning latent topic representation, where a graph regularizer is involved to derive the latent topic representation while preserving data proximity. In the first approach, we extend the idea of Laplacian probabilistic latent semantic analysis (LapPLSA) by introducing a distance penalty function in the affinity matrix of a graph for latent topic estimation. The estimated latent topic distributions are used to replace the traditional term-frequency vectors as the data representation of the text blocks and to measure the cohesive strength between them. In the second approach, we perform Laplacian eigenmaps, which makes use of the graph regularizer for dimensionality reduction, on latent topic distributions estimated by conventional topic modeling. We conduct the experiments on the automatic speech recognition transcripts of the TDT2 English broadcast news corpus. The experiments show the proposed strategy outperforms the conventional techniques. LapPLSA performs the best with the highest F1-measure of 0.816. The effects of the penalty constant in the distance penalty function, the number of latent topics, and the size of training data on the segmentation performances are also studied.

源语言	英语
页（从-至）	108-119
页数	12
期刊	IEEE/ACM Transactions on Audio Speech and Language Processing
卷	25
期	1
DOI	https://doi.org/10.1109/TASLP.2016.2626965
出版状态	已出版 - 1月 2017

访问文件

10.1109/TASLP.2016.2626965

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{fc07193031b643baa447b2c3f1bed034,

title = "Modeling latent topics and temporal distance for story segmentation of broadcast news",

abstract = "This paper studies a strategy to model latent topics and temporal distance of text blocks for story segmentation, that we call graph regularization in topic modeling or GRTM. We propose two novel approaches that consider both temporal distance and lexical similarity of text blocks, collectively referred to as data proximity, in learning latent topic representation, where a graph regularizer is involved to derive the latent topic representation while preserving data proximity. In the first approach, we extend the idea of Laplacian probabilistic latent semantic analysis (LapPLSA) by introducing a distance penalty function in the affinity matrix of a graph for latent topic estimation. The estimated latent topic distributions are used to replace the traditional term-frequency vectors as the data representation of the text blocks and to measure the cohesive strength between them. In the second approach, we perform Laplacian eigenmaps, which makes use of the graph regularizer for dimensionality reduction, on latent topic distributions estimated by conventional topic modeling. We conduct the experiments on the automatic speech recognition transcripts of the TDT2 English broadcast news corpus. The experiments show the proposed strategy outperforms the conventional techniques. LapPLSA performs the best with the highest F1-measure of 0.816. The effects of the penalty constant in the distance penalty function, the number of latent topics, and the size of training data on the segmentation performances are also studied.",

keywords = "Graph regularization, Laplacian eigenmaps, Laplacian probabilistic latent semantic analysis, topic modeling, topic segmentation",

author = "Hongjie Chen and Lei Xie and Leung, {Cheung Chi} and Xiaoming Lu and Bin Ma and Haizhou Li",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.",

year = "2017",

month = jan,

doi = "10.1109/TASLP.2016.2626965",

language = "英语",

volume = "25",

pages = "108--119",

journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",

issn = "2329-9290",

publisher = "IEEE Advancing Technology for Humanity",

number = "1",

}

TY - JOUR

T1 - Modeling latent topics and temporal distance for story segmentation of broadcast news

AU - Chen, Hongjie

AU - Xie, Lei

AU - Leung, Cheung Chi

AU - Lu, Xiaoming

AU - Ma, Bin

AU - Li, Haizhou

PY - 2017/1

Y1 - 2017/1

N2 - This paper studies a strategy to model latent topics and temporal distance of text blocks for story segmentation, that we call graph regularization in topic modeling or GRTM. We propose two novel approaches that consider both temporal distance and lexical similarity of text blocks, collectively referred to as data proximity, in learning latent topic representation, where a graph regularizer is involved to derive the latent topic representation while preserving data proximity. In the first approach, we extend the idea of Laplacian probabilistic latent semantic analysis (LapPLSA) by introducing a distance penalty function in the affinity matrix of a graph for latent topic estimation. The estimated latent topic distributions are used to replace the traditional term-frequency vectors as the data representation of the text blocks and to measure the cohesive strength between them. In the second approach, we perform Laplacian eigenmaps, which makes use of the graph regularizer for dimensionality reduction, on latent topic distributions estimated by conventional topic modeling. We conduct the experiments on the automatic speech recognition transcripts of the TDT2 English broadcast news corpus. The experiments show the proposed strategy outperforms the conventional techniques. LapPLSA performs the best with the highest F1-measure of 0.816. The effects of the penalty constant in the distance penalty function, the number of latent topics, and the size of training data on the segmentation performances are also studied.

AB - This paper studies a strategy to model latent topics and temporal distance of text blocks for story segmentation, that we call graph regularization in topic modeling or GRTM. We propose two novel approaches that consider both temporal distance and lexical similarity of text blocks, collectively referred to as data proximity, in learning latent topic representation, where a graph regularizer is involved to derive the latent topic representation while preserving data proximity. In the first approach, we extend the idea of Laplacian probabilistic latent semantic analysis (LapPLSA) by introducing a distance penalty function in the affinity matrix of a graph for latent topic estimation. The estimated latent topic distributions are used to replace the traditional term-frequency vectors as the data representation of the text blocks and to measure the cohesive strength between them. In the second approach, we perform Laplacian eigenmaps, which makes use of the graph regularizer for dimensionality reduction, on latent topic distributions estimated by conventional topic modeling. We conduct the experiments on the automatic speech recognition transcripts of the TDT2 English broadcast news corpus. The experiments show the proposed strategy outperforms the conventional techniques. LapPLSA performs the best with the highest F1-measure of 0.816. The effects of the penalty constant in the distance penalty function, the number of latent topics, and the size of training data on the segmentation performances are also studied.

KW - Graph regularization

KW - Laplacian eigenmaps

KW - Laplacian probabilistic latent semantic analysis

KW - topic modeling

KW - topic segmentation

UR - http://www.scopus.com/inward/record.url?scp=85002616454&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2016.2626965

DO - 10.1109/TASLP.2016.2626965

M3 - 文章

AN - SCOPUS:85002616454

SN - 2329-9290

VL - 25

SP - 108

EP - 119

JO - IEEE/ACM Transactions on Audio Speech and Language Processing

JF - IEEE/ACM Transactions on Audio Speech and Language Processing

IS - 1

ER -

Modeling latent topics and temporal distance for story segmentation of broadcast news

摘要

访问文件

其它文件与链接

指纹

引用此