Cross-language web page classification via dual knowledge transfer using Nonnegative Matrix Tri-factorization

Hua Wang, Heng Huang, Feiping Nie, Chris Ding

科研成果: 书/报告/会议事项章节会议稿件同行评审

57 引用 (Scopus)

摘要

The lack of sufficient labeled Web pages in many languages, especially for those uncommonly used ones, presents a great challenge to traditional supervised classification methods to achieve satisfactory Web page classification performance. To address this, we propose a novel Nonnegative Matrix Tri-factorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification, which is based on the following two important observations. First, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. Second, we also observe that the associations between word clusters and Web page classes are a more reliable carrier than raw words to transfer knowledge across languages. With these recognitions, we attempt to transfer knowledge from the auxiliary language, in which abundant labeled Web pages are available, to target languages, in which we want classify Web pages, through two different paths: word cluster approximations and the associations between word clusters and Web page classes. Due to the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results demonstrate the effectiveness of our approach that is consistent with our theoretical analyses.

源语言英语
主期刊名SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
出版商Association for Computing Machinery
933-942
页数10
ISBN(印刷版)9781450309349
DOI
出版状态已出版 - 2011
已对外发布
活动34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011 - Beijing, 中国
期限: 24 7月 201128 7月 2011

出版系列

姓名SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

会议

会议34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011
国家/地区中国
Beijing
时期24/07/1128/07/11

指纹

探究 'Cross-language web page classification via dual knowledge transfer using Nonnegative Matrix Tri-factorization' 的科研主题。它们共同构成独一无二的指纹。

引用此