Classifying web pages using information extraction patterns - Preliminary results and findings

Lay Ki Soon, Sang Ho Lee

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review


Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are identified within the pages respectively. In this paper, we present the results as well as the findings obtained from our preliminary experiments. Our experimental results indicate that the existence of a word in different contexts has different impact to the classification task. Thus, the extraction patterns used to represent each document are more semantically meaningful and give better insight to web classification in comparison with keywords.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2010
Number of pages8
Publication statusPublished - 2010
Externally publishedYes
EventInternational Conference on Signal Image Technology & Internet Based Systems 2010 - Kuala Lumpur, Malaysia
Duration: 15 Dec 201018 Dec 2010
Conference number: 6th (Proceedings)


ConferenceInternational Conference on Signal Image Technology & Internet Based Systems 2010
Abbreviated titleSITIS 2010
CityKuala Lumpur
Internet address


  • Decision tree
  • Information extraction
  • Information gain
  • Web classification
  • Web mining

Cite this