Multidocument Arabic text summarization based on clustering and word2vec to reduce redundancy

Samer Abdulateef; Naseer Ahmed Khan; Bolin Chen; Xuequn Shang

doi:10.3390/info11020059

Multidocument Arabic text summarization based on clustering and word2vec to reduce redundancy

Samer Abdulateef, Naseer Ahmed Khan, Bolin Chen, Xuequn Shang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

47 引用（Scopus）

摘要

Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences' encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.

源语言	英语
文章编号	59
期刊	Information (Switzerland)
卷	11
期	2
DOI	https://doi.org/10.3390/info11020059
出版状态	已出版 - 1 2月 2020

访问文件

10.3390/info11020059

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{fe13f2034d0c4ab6bffbd1cd9d8762bf,

title = "Multidocument Arabic text summarization based on clustering and word2vec to reduce redundancy",

abstract = "Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences' encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.",

keywords = "Arabic text summarization, Multidocument text summarization, Text clustering, Word2vec",

author = "Samer Abdulateef and Khan, {Naseer Ahmed} and Bolin Chen and Xuequn Shang",

note = "Publisher Copyright: {\textcopyright} 2020 by the author.",

year = "2020",

month = feb,

day = "1",

doi = "10.3390/info11020059",

language = "英语",

volume = "11",

journal = "Information (Switzerland)",

issn = "2078-2489",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "2",

}

TY - JOUR

T1 - Multidocument Arabic text summarization based on clustering and word2vec to reduce redundancy

AU - Abdulateef, Samer

AU - Khan, Naseer Ahmed

AU - Chen, Bolin

AU - Shang, Xuequn

PY - 2020/2/1

Y1 - 2020/2/1

N2 - Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences' encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.

AB - Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences' encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.

KW - Arabic text summarization

KW - Multidocument text summarization

KW - Text clustering

KW - Word2vec

UR - http://www.scopus.com/inward/record.url?scp=85081131087&partnerID=8YFLogxK

U2 - 10.3390/info11020059

DO - 10.3390/info11020059

M3 - 文章

AN - SCOPUS:85081131087

SN - 2078-2489

VL - 11

JO - Information (Switzerland)

JF - Information (Switzerland)

IS - 2

M1 - 59

ER -

Multidocument Arabic text summarization based on clustering and word2vec to reduce redundancy

摘要

访问文件

其它文件与链接

指纹

引用此