TY - JOUR
T1 - Sensitive information recognition based on short text sentiment analysis
AU - Li, Yang
AU - Pan, Quan
AU - Yang, Tao
N1 - Publisher Copyright:
© 2016, Editorial Office of Journal of Xi'an Jiaotong University. All right reserved.
PY - 2016/9/10
Y1 - 2016/9/10
N2 - The existing sensitive information recognition is based on the sensitive keyword matching method, so the accuracy is low and the miss rate is high. We presented a collaborative method by using the sensitive keywords and sentiment polarities to identify the sensitive information. In the real dataset, we used the supervised way to measure the sentiment polarities of the blogs, and divided the blogs into two categories, namely the blogs are with positive or negative sentiment polarities. Five kinds of 2 639 sensitive keywords, including pornography, violence, illegality, cult and reactionary, were defined, and it was found that according to the Zipf distribution of these words in the dataset, the contents of blogs with negative sentiment polarities exhibited high sensitivities. Then we studied the contribution of the sensitive keywords to the sentiment polarity, and constructed the model of sensitivity degree that contains the sentiment polarity factor. Based on this, we proposed a new way to identify the sensitive information, which makes the accuracy and miss rate improved from 31.25% to 58.75% and from 95% to 96%, respectively, and the F-measure was improved from 47.0%to 72.3%.
AB - The existing sensitive information recognition is based on the sensitive keyword matching method, so the accuracy is low and the miss rate is high. We presented a collaborative method by using the sensitive keywords and sentiment polarities to identify the sensitive information. In the real dataset, we used the supervised way to measure the sentiment polarities of the blogs, and divided the blogs into two categories, namely the blogs are with positive or negative sentiment polarities. Five kinds of 2 639 sensitive keywords, including pornography, violence, illegality, cult and reactionary, were defined, and it was found that according to the Zipf distribution of these words in the dataset, the contents of blogs with negative sentiment polarities exhibited high sensitivities. Then we studied the contribution of the sensitive keywords to the sentiment polarity, and constructed the model of sensitivity degree that contains the sentiment polarity factor. Based on this, we proposed a new way to identify the sensitive information, which makes the accuracy and miss rate improved from 31.25% to 58.75% and from 95% to 96%, respectively, and the F-measure was improved from 47.0%to 72.3%.
KW - Sensitive information
KW - Sentiment analysis
KW - Social networks
UR - http://www.scopus.com/inward/record.url?scp=84987881627&partnerID=8YFLogxK
U2 - 10.7652/xjtuxb201609013
DO - 10.7652/xjtuxb201609013
M3 - 文章
AN - SCOPUS:84987881627
SN - 0253-987X
VL - 50
SP - 80
EP - 84
JO - Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University
JF - Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University
IS - 9
ER -