Identifying equivalent URLs using URL signatures

Lay Ki Soon, Sang Ho Lee

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

6 Citations (Scopus)

Abstract

In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to enhance the standard URL normalization by incorporating the semantically meaningful metadata of the Web pages. The metadata taken into account are the body texts of the Web pages, which can be extracted during HTML parsing. Given a URL which has undergone the standard normalization mechanism, we construct its URL signature by hashing or fingerprinting the body text of the associated Web page using Message-Digest algorithm 5. URLs which share identical signatures are considered to be equivalent in our scheme. The experimental results show that our proposed method helps to further reduce redundant Web information retrieval by 34.57% in comparison with the standard URL normalization mechanism.

Original languageEnglish
Title of host publicationSITIS 2008 - Proceedings of the 4th International Conference on Signal Image Technology and Internet Based Systems
Pages203-210
Number of pages8
DOIs
Publication statusPublished - 2008
Externally publishedYes
EventInternational Conference on Signal Image Technology & Internet Based Systems 2008 - Bali, Indonesia
Duration: 30 Nov 20083 Dec 2008
Conference number: 4TH
https://ieeexplore.ieee.org/xpl/conhome/4725760/proceeding (Proceedings)

Conference

ConferenceInternational Conference on Signal Image Technology & Internet Based Systems 2008
Abbreviated titleSITIS 2008
Country/TerritoryIndonesia
CityBali
Period30/11/083/12/08
Internet address

Cite this