A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Sunil Aryal, Kai Ming Ting, Takashi Washio, Gholamreza Haffari

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as ℓp-norm with p> 0), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and mp-dissimilarity (p> 0), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise mp-dissimilarity where p≥ 0 by introducing m-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of mp-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.

Original languageEnglish
Pages (from-to)124-162
Number of pages39
JournalData Mining and Knowledge Discovery
Volume34
Issue number1
DOIs
Publication statusPublished - 2020

Keywords

  • Data-dependent similarity measures
  • Distance measures
  • Lin’s probabilistic similarity
  • m-dissimilarity
  • Rank transformation
  • ℓ-norm

Cite this