Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as ℓp-norm with p> 0), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and mp-dissimilarity (p> 0), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise mp-dissimilarity where p≥ 0 by introducing m-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of mp-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.
- Data-dependent similarity measures
- Distance measures
- Lin’s probabilistic similarity
- Rank transformation