Abstract
URL signature was proposed to be implemented in web crawling, aiming to avoid processing duplicated web pages for further web crawling. In this paper, we present our performance study on an open source web crawler - WebSPHINX, in which we have embedded URL signature. The experimental result indicates that URL signature is able to reduce the processing of duplicated web pages significantly for further web crawling at a negligible cost compared to the one without URL signature.
Original language | English |
---|---|
Title of host publication | Proceedings - 2012 4th Conference on Data Mining and Optimization, DMO 2012 |
Pages | 127-130 |
Number of pages | 4 |
DOIs | |
Publication status | Published - 2012 |
Externally published | Yes |
Event | Conference on Data Mining and Optimization 2012 - Langkawi, Malaysia Duration: 2 Sept 2012 → 4 Sept 2012 Conference number: 4th https://ieeexplore.ieee.org/xpl/conhome/6322848/proceeding (Proceedings) |
Publication series
Name | Conference on Data Mining and Optimization |
---|---|
ISSN (Print) | 2155-6938 |
ISSN (Electronic) | 2155-6946 |
Conference
Conference | Conference on Data Mining and Optimization 2012 |
---|---|
Abbreviated title | DMO 2012 |
Country/Territory | Malaysia |
City | Langkawi |
Period | 2/09/12 → 4/09/12 |
Internet address |
Keywords
- URL normalization
- URL signature
- web crawling