SimUSF: An efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Thilak L. Fernando, Geoffrey I. Webb

    Research output: Contribution to journalArticleResearchpeer-review

    3 Citations (Scopus)

    Abstract

    Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

    Original languageEnglish
    Pages (from-to)264-286
    Number of pages23
    JournalData Mining and Knowledge Discovery
    Volume31
    Issue number1
    DOIs
    Publication statusPublished - Jan 2017

    Keywords

    • Similarity measure
    • Interval scale
    • Clustering
    • CBMIR

    Cite this

    @article{13b59318faec402eacb909e1b72d1dac,
    title = "SimUSF: An efficient and effective similarity measure that is invariant to violations of the interval scale assumption",
    abstract = "Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.",
    keywords = "Similarity measure, Interval scale, Clustering, CBMIR",
    author = "Fernando, {Thilak L.} and Webb, {Geoffrey I.}",
    year = "2017",
    month = "1",
    doi = "10.1007/s10618-016-0463-0",
    language = "English",
    volume = "31",
    pages = "264--286",
    journal = "Data Mining and Knowledge Discovery",
    issn = "1384-5810",
    publisher = "Springer",
    number = "1",

    }

    SimUSF : An efficient and effective similarity measure that is invariant to violations of the interval scale assumption. / Fernando, Thilak L.; Webb, Geoffrey I.

    In: Data Mining and Knowledge Discovery, Vol. 31, No. 1, 01.2017, p. 264-286.

    Research output: Contribution to journalArticleResearchpeer-review

    TY - JOUR

    T1 - SimUSF

    T2 - An efficient and effective similarity measure that is invariant to violations of the interval scale assumption

    AU - Fernando, Thilak L.

    AU - Webb, Geoffrey I.

    PY - 2017/1

    Y1 - 2017/1

    N2 - Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

    AB - Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

    KW - Similarity measure

    KW - Interval scale

    KW - Clustering

    KW - CBMIR

    UR - http://www.scopus.com/inward/record.url?scp=84966699249&partnerID=8YFLogxK

    U2 - 10.1007/s10618-016-0463-0

    DO - 10.1007/s10618-016-0463-0

    M3 - Article

    AN - SCOPUS:84966699249

    VL - 31

    SP - 264

    EP - 286

    JO - Data Mining and Knowledge Discovery

    JF - Data Mining and Knowledge Discovery

    SN - 1384-5810

    IS - 1

    ER -