Abstract
Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are identified within the pages respectively. In this paper, we present the results as well as the findings obtained from our preliminary experiments. Our experimental results indicate that the existence of a word in different contexts has different impact to the classification task. Thus, the extraction patterns used to represent each document are more semantically meaningful and give better insight to web classification in comparison with keywords.
Original language | English |
---|---|
Title of host publication | Proceedings of the 6th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2010 |
Pages | 195-202 |
Number of pages | 8 |
DOIs | |
Publication status | Published - 2010 |
Externally published | Yes |
Event | International Conference on Signal Image Technology & Internet Based Systems 2010 - Kuala Lumpur, Malaysia Duration: 15 Dec 2010 → 18 Dec 2010 Conference number: 6th https://ieeexplore.ieee.org/xpl/conhome/5714190/proceeding (Proceedings) |
Conference
Conference | International Conference on Signal Image Technology & Internet Based Systems 2010 |
---|---|
Abbreviated title | SITIS 2010 |
Country/Territory | Malaysia |
City | Kuala Lumpur |
Period | 15/12/10 → 18/12/10 |
Internet address |
Keywords
- Decision tree
- Information extraction
- Information gain
- Web classification
- Web mining