A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names

Kridaraan Komahan, Daniel Reidpath

    Research output: Contribution to journalArticleResearchpeer-review

    1 Citation (Scopus)

    Abstract

    Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative na?ve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were provided by a health and demographic surveillance site operating in Malaysia from 2011-2013. The data comprised a training data set (n = 10,104) and a test data set (n = 9,992). Names were spliced into contiguous 3-letter substrings, and these were used as the basis for the Bayesian analysis. Performance was evaluated on both data sets using Cohen s ? and measures of sensitivity and specificity. There was little difference between the classification performance in the training and test data (? = 0.93 and 0.94, respectively). For the test data, the sensitivity values for the Malay, Indian, and Chinese names were 0.997, 0.855, and 0.932, respectively, and the specificity values were 0.907, 0.998, and 0.997, respectively. A na?ve Bayesian strategy for the classification of ethnicity is promising. It performs at least as well as more sophisticated approaches. The possible application to smaller data sets is particularly appealing. Further research examining other substring lengths and other ethnic groups is warranted.
    Original languageEnglish
    Pages (from-to)325 - 329
    Number of pages5
    JournalAmerican Journal of Epidemiology
    Volume180
    Issue number3
    DOIs
    Publication statusPublished - 2014

    Cite this