TY - JOUR
T1 - Unsupervised measure of Chinese lexical semantic similarity using correlated graph model for news story segmentation
AU - Feng, Wei
AU - Nie, Xuecheng
AU - Zhang, Yujun
AU - Xie, Lei
AU - Dang, Jianwu
N1 - Publisher Copyright:
© 2018 Elsevier B.V.
PY - 2018/11/27
Y1 - 2018/11/27
N2 - This paper presents a simple yet effective approach to unsupervisedly measuring Chinese lexical semantic similarity, and shows its promising performance in automatic story segmentation of Mandarin broadcast news. Our approach centers on the unsupervised correlated affinity graph (UCAG) model, which is initialized as a hybrid sparse graph, encoding both explicit word-to-word contextual correlations and latent word-to-character correlations within the given corpus. The UCAG model further diffuses the initial sparse correlations throughout the graph by parallel affinity propagation. This provides us with a dense, reliable, and corpus-specific lexical semantic similarity measure, which comes from purely unlabeled data. We then generalize the classical cosine similarity metric to effectively take soft similarities into account for story segmentation. Extensive experiments on benchmark datasets validate the superiority of the proposed similarity measure over previous measures. We specifically show that our similarity measure averagely helps to achieve 7.7% relative F1-score improvement to the accuracy of state-of-art normalized cuts (NCuts) based story segmentation on two holistic benchmark Mandarin broadcast news corpora, TDT2 and CCTV, and achieves 10.8% relative F1-score improvement on the detailed broadcast news subsets.
AB - This paper presents a simple yet effective approach to unsupervisedly measuring Chinese lexical semantic similarity, and shows its promising performance in automatic story segmentation of Mandarin broadcast news. Our approach centers on the unsupervised correlated affinity graph (UCAG) model, which is initialized as a hybrid sparse graph, encoding both explicit word-to-word contextual correlations and latent word-to-character correlations within the given corpus. The UCAG model further diffuses the initial sparse correlations throughout the graph by parallel affinity propagation. This provides us with a dense, reliable, and corpus-specific lexical semantic similarity measure, which comes from purely unlabeled data. We then generalize the classical cosine similarity metric to effectively take soft similarities into account for story segmentation. Extensive experiments on benchmark datasets validate the superiority of the proposed similarity measure over previous measures. We specifically show that our similarity measure averagely helps to achieve 7.7% relative F1-score improvement to the accuracy of state-of-art normalized cuts (NCuts) based story segmentation on two holistic benchmark Mandarin broadcast news corpora, TDT2 and CCTV, and achieves 10.8% relative F1-score improvement on the detailed broadcast news subsets.
KW - Common character correlation
KW - Contextual correlation
KW - Generalized cosine similarity
KW - Parallel affinity propagation
KW - Story segmentation
KW - Unsupervised correlated affinity graph (UCAG) model
UR - http://www.scopus.com/inward/record.url?scp=85053163510&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2018.08.061
DO - 10.1016/j.neucom.2018.08.061
M3 - 文章
AN - SCOPUS:85053163510
SN - 0925-2312
VL - 318
SP - 236
EP - 247
JO - Neurocomputing
JF - Neurocomputing
ER -