Cross-language web page classification via dual knowledge transfer using Nonnegative Matrix Tri-factorization

Hua Wang, Heng Huang, Feiping Nie, Chris Ding

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

57 Scopus citations

Abstract

The lack of sufficient labeled Web pages in many languages, especially for those uncommonly used ones, presents a great challenge to traditional supervised classification methods to achieve satisfactory Web page classification performance. To address this, we propose a novel Nonnegative Matrix Tri-factorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification, which is based on the following two important observations. First, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. Second, we also observe that the associations between word clusters and Web page classes are a more reliable carrier than raw words to transfer knowledge across languages. With these recognitions, we attempt to transfer knowledge from the auxiliary language, in which abundant labeled Web pages are available, to target languages, in which we want classify Web pages, through two different paths: word cluster approximations and the associations between word clusters and Web page classes. Due to the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results demonstrate the effectiveness of our approach that is consistent with our theoretical analyses.

Original languageEnglish
Title of host publicationSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery
Pages933-942
Number of pages10
ISBN (Print)9781450309349
DOIs
StatePublished - 2011
Externally publishedYes
Event34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011 - Beijing, China
Duration: 24 Jul 201128 Jul 2011

Publication series

NameSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011
Country/TerritoryChina
CityBeijing
Period24/07/1128/07/11

Keywords

  • Cross-language classification
  • Knowledge transfer
  • Nonnegative matrix factorization

Fingerprint

Dive into the research topics of 'Cross-language web page classification via dual knowledge transfer using Nonnegative Matrix Tri-factorization'. Together they form a unique fingerprint.

Cite this