Max margin learning on domain-independent web information extraction

Bin Zhao, Xiaoxin Yin, Eric P. Xing

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Domain-independent web information extraction can be addressed as a structured prediction problem where we learn a mapping function from an input web page to the structured and interdependent output variables, labeling each block on the page. In this paper, built upon an HTML parser of Internet Explorer that parses and renders a web page based on HTML tags and visual appearance, we propose a max margin learning approach for web information extraction. Specifically, the output of the parser is a vision tree, which is similar to a DOM tree but with visual information, i.e., how each node is displayed. Based on this hierarchical structure, we develop a max margin learning method for labeling each of its nodes. Due to the rich connections between blocks on the web page, we further introduce edges that connect spatially adjacent nodes on the vision tree, complicating the problem into a cyclic graph labeling task. A max margin learning method on cyclic graphs is developed for this problem, where loopy belief propagation is used for approximate inference. Experimental results on web data extraction show the feasibility and promise of our approach.

Original languageEnglish
Title of host publicationCIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management
Pages1305-1310
Number of pages6
DOIs
StatePublished - 2011
Externally publishedYes
Event20th ACM Conference on Information and Knowledge Management, CIKM'11 - Glasgow, United Kingdom
Duration: 24 Oct 201128 Oct 2011

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference20th ACM Conference on Information and Knowledge Management, CIKM'11
Country/TerritoryUnited Kingdom
CityGlasgow
Period24/10/1128/10/11

Keywords

  • belief propagation
  • cutting plane
  • max margin learning
  • web information extraction

Fingerprint

Dive into the research topics of 'Max margin learning on domain-independent web information extraction'. Together they form a unique fingerprint.

Cite this