Classifying web pages using information extraction patterns - Preliminary results and findings

Lay Ki Soon, Sang Ho Lee

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are identified within the pages respectively. In this paper, we present the results as well as the findings obtained from our preliminary experiments. Our experimental results indicate that the existence of a word in different contexts has different impact to the classification task. Thus, the extraction patterns used to represent each document are more semantically meaningful and give better insight to web classification in comparison with keywords.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2010
Pages195-202
Number of pages8
DOIs
Publication statusPublished - 2010
Externally publishedYes
EventInternational Conference on Signal Image Technology & Internet Based Systems 2010 - Kuala Lumpur, Malaysia
Duration: 15 Dec 201018 Dec 2010
Conference number: 6th
https://ieeexplore.ieee.org/xpl/conhome/5714190/proceeding (Proceedings)

Conference

ConferenceInternational Conference on Signal Image Technology & Internet Based Systems 2010
Abbreviated titleSITIS 2010
Country/TerritoryMalaysia
CityKuala Lumpur
Period15/12/1018/12/10
Internet address

Keywords

  • Decision tree
  • Information extraction
  • Information gain
  • Web classification
  • Web mining

Cite this