Abstract
In this paper, we present our proposed method of incorporating metadata of Web pages to identify equivalent URLs in addition to the standard URL normalization methodology. The metadata considered are the page size and the body text of Web pages. These metadata can be obtained during HTML parsing in the process of crawling without incurring unnecessary cost. Our experiment shows an accuracy of up to 95.38% in identifying equivalent URLs by using the body text of Web pages.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2008 International Conference on Computer and Electrical Engineering, ICCEE 2008 |
Pages | 331-335 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 2008 |
Externally published | Yes |
Event | International Conference on Computer and Electrical Engineering 2008 - Phuket, Thailand Duration: 20 Dec 2008 → 22 Dec 2008 https://ieeexplore.ieee.org/xpl/conhome/4740925/proceeding (Proceedings) |
Conference
Conference | International Conference on Computer and Electrical Engineering 2008 |
---|---|
Abbreviated title | ICCEE 2008 |
Country/Territory | Thailand |
City | Phuket |
Period | 20/12/08 → 22/12/08 |
Internet address |