TY - GEN
T1 - Measuring semantic similarity by contextualword connections in Chinese news story segmentation
AU - Nie, Xuecheng
AU - Feng, Wei
AU - Wan, Liang
AU - Xie, Lei
PY - 2013/10/18
Y1 - 2013/10/18
N2 - A lot of recent work in story segmentation focuses on developing better partitioning criteria to segment news transcripts into sequences of topically coherent stories, while simply relying on the repetition based hard word-level similarities and ignoring the semantic correlations between different words. In this paper, we propose a purely data-driven approach to measuring soft semantic word- and sentence-level similarity from a given corpus, without the guidance of linguistic knowledge, ground-truth topic labeling or story boundaries. We show that contextual word connections can help to produce semantically meaningful similarity measurement between any pair of Chinese words. Based on this, we further use a parallel all-pair SimRank algorithm to propagate such contextual similarities throughout the whole vocabulary. The resultant word semantic similarity matrix is then used to refine the classical cosine similarity measurement of sentences. Experiments on benchmark Chinese news corpora show that, story segmentation using the proposed soft semantic similarity measurement can always produce better segmentation accuracy than using the hard similarity. Specifically, we can achieve 3%-10% average F1-measure improvement to state-of-the-art NCuts based story segmentation.
AB - A lot of recent work in story segmentation focuses on developing better partitioning criteria to segment news transcripts into sequences of topically coherent stories, while simply relying on the repetition based hard word-level similarities and ignoring the semantic correlations between different words. In this paper, we propose a purely data-driven approach to measuring soft semantic word- and sentence-level similarity from a given corpus, without the guidance of linguistic knowledge, ground-truth topic labeling or story boundaries. We show that contextual word connections can help to produce semantically meaningful similarity measurement between any pair of Chinese words. Based on this, we further use a parallel all-pair SimRank algorithm to propagate such contextual similarities throughout the whole vocabulary. The resultant word semantic similarity matrix is then used to refine the classical cosine similarity measurement of sentences. Experiments on benchmark Chinese news corpora show that, story segmentation using the proposed soft semantic similarity measurement can always produce better segmentation accuracy than using the hard similarity. Specifically, we can achieve 3%-10% average F1-measure improvement to state-of-the-art NCuts based story segmentation.
KW - contextual word connections
KW - Semantic similarity
KW - similarity propagation
KW - story segmentation
UR - http://www.scopus.com/inward/record.url?scp=84890497890&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2013.6639286
DO - 10.1109/ICASSP.2013.6639286
M3 - 会议稿件
AN - SCOPUS:84890497890
SN - 9781479903566
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 8312
EP - 8316
BT - 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Proceedings
T2 - 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
Y2 - 26 May 2013 through 31 May 2013
ER -