Joint-feature (JFEAT) web page classification

Lim Wern Han, Saadat M. Alhashmi

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet with accuracy at reasonable cost and feasible performance. A potential solution would be through web page classification. An effective classification of web pages is of benefit in various applications such as web mining and search engines. Unlike text documents, the nature of web pages limits the performance of successful traditional pure-text classification methods. Existence of noises in the form of HTML tags, multimedia contents, dynamic contents and the network structure of web pages requires a deeper look into effective feature selection of web pages. Often, these features are filtered out relying on the displayed texts of the web page for classification. Instead in this research paper, web page features are taken into consideration during classification of the web page due to the potential valuable information that might be stored. For this reason, this paper explores the potential of the universal Resource Locator (URL), web page title as well as the metadata for classification into various categories defined by the users. The framework uses suitable machine learning algorithms for individual classification of these web features to jointly vote by weight towards the eventual classification of the webpage. This approach showed improvements over pure-text as well as virtual-webpage classification approaches.

Original languageEnglish
Title of host publicationBusiness Transformation through Innovation and Knowledge Management
Subtitle of host publicationAn Academic Perspective - Proceedings of the 14th International Business Information Management Association Conference, IBIMA 2010
PublisherIBIMA Publishing
Pages819-828
Number of pages10
ISBN (Print)9780982148938
Publication statusPublished - 2010
EventInternational Business Information Management 2010 - Istanbul, Türkiye
Duration: 23 Jun 201024 Jun 2010
Conference number: 14th
https://ibima.org/conference/14th-ibima-conference/

Publication series

NameBusiness Transformation through Innovation and Knowledge Management: An Academic Perspective - Proceedings of the 14th International Business Information Management Association Conference, IBIMA 2010
Volume2

Conference

ConferenceInternational Business Information Management 2010
Abbreviated titleIBIMA 2010
Country/TerritoryTürkiye
CityIstanbul
Period23/06/1024/06/10
OtherIBIMA held two conferences in 2010 with the same name.
Internet address

Keywords

  • Feature selection
  • Machine learning
  • Page classification

Cite this