TY - JOUR
T1 - Clustering web documents using cocitation, coupling, incoming, and outgoing hyperlinks
T2 - a comparative performance analysis of algorithms
AU - Tanti Wijaya, Derry
AU - Bressan, Stéphane
PY - 2006/5/1
Y1 - 2006/5/1
N2 - Querying search engines with the keyword “jaguars” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of documents in terms of contents, it seems natural to expect that the very structure of the Web carries important information about the topical similarity of documents. Here we study the role of a matrix constructed from weighted cocitations (documents referenced by the same document), weighted couplings (documents referencing the same document), incoming, and outgoing links for the clustering of documents on the Web. We present and discuss three methods of clustering based on this matrix construction using three clustering algorithms, Kmeans, Markov and Maximum Spanning Tree, respectively. Our main contribution is a clustering technique based on the Maximum Spanning Tree technique and an evaluation of its effectiveness comparatively to the two most robust alternatives: Kmeans and Markov clustering.
AB - Querying search engines with the keyword “jaguars” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of documents in terms of contents, it seems natural to expect that the very structure of the Web carries important information about the topical similarity of documents. Here we study the role of a matrix constructed from weighted cocitations (documents referenced by the same document), weighted couplings (documents referencing the same document), incoming, and outgoing links for the clustering of documents on the Web. We present and discuss three methods of clustering based on this matrix construction using three clustering algorithms, Kmeans, Markov and Maximum Spanning Tree, respectively. Our main contribution is a clustering technique based on the Maximum Spanning Tree technique and an evaluation of its effectiveness comparatively to the two most robust alternatives: Kmeans and Markov clustering.
KW - Clustering
KW - Coupling
KW - Cocitation
KW - Hyperlinks
KW - Search engines
UR - http://www.scopus.com/inward/record.url?scp=77954277914&partnerID=8YFLogxK
U2 - 10.1108/17440080680000102
DO - 10.1108/17440080680000102
M3 - Article
AN - SCOPUS:77954277914
SN - 1744-0084
VL - 2
SP - 69
EP - 76
JO - International Journal of Web Information Systems
JF - International Journal of Web Information Systems
IS - 2
ER -