Joint-feature (JFEAT) web page classification

Lim Wern Han, Saadat M. Alhashmi

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

    Abstract

    With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet with accuracy at reasonable cost and feasible performance. A potential solution would be through web page classification. An effective classification of web pages is of benefit in various applications such as web mining and search engines. Unlike text documents, the nature of web pages limits the performance of successful traditional pure-text classification methods. Existence of noises in the form of HTML tags, multimedia contents, dynamic contents and the network structure of web pages requires a deeper look into effective feature selection of web pages. Often, these features are filtered out relying on the displayed texts of the web page for classification. Instead in this research paper, web page features are taken into consideration during classification of the web page due to the potential valuable information that might be stored. For this reason, this paper explores the potential of the universal Resource Locator (URL), web page title as well as the metadata for classification into various categories defined by the users. The framework uses suitable machine learning algorithms for individual classification of these web features to jointly vote by weight towards the eventual classification of the webpage. This approach showed improvements over pure-text as well as virtual-webpage classification approaches.

    Original languageEnglish
    Title of host publicationBusiness Transformation through Innovation and Knowledge Management
    Subtitle of host publicationAn Academic Perspective - Proceedings of the 14th International Business Information Management Association Conference, IBIMA 2010
    PublisherInternational Business Information Management Association (IBIMA)
    Pages819-828
    Number of pages10
    ISBN (Print)9780982148938
    Publication statusPublished - 2010
    EventInternational Business Information Management 2010 - Istanbul, Turkey
    Duration: 23 Jun 201024 Jun 2010
    Conference number: 14th
    https://ibima.org/conference/14th-ibima-conference/

    Publication series

    NameBusiness Transformation through Innovation and Knowledge Management: An Academic Perspective - Proceedings of the 14th International Business Information Management Association Conference, IBIMA 2010
    Volume2

    Conference

    ConferenceInternational Business Information Management 2010
    Abbreviated titleIBIMA 2010
    CountryTurkey
    CityIstanbul
    Period23/06/1024/06/10
    OtherIBIMA held two conferences in 2010 with the same name.
    Internet address

    Keywords

    • Feature selection
    • Machine learning
    • Page classification

    Cite this