Improving scalability and performance of random forest based learning-to-rank algorithms by aggressive subsampling

Muhammad Ibrahim, Mark Carman

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

    2 Citations (Scopus)

    Abstract

    Random forest based Learning-to-rank (LtR) algorithms exhibit competitive performance to other state-of-the-art algorithms. Traditionally, each tree of the forest is learnt from a bootstrapped copy (sampled with replacement) of the training set, where approximately 63% examples are unique, although some studies show that sampling without replacement also works well. The goal of using a bootstrapped copy instead of the original training set is to reduce correlation among individual trees, thereby making the prediction of the ensemble more accurate. In this study, we investigate whether we can decrease the correlation of the trees even more without compromising accuracy. Among several potential options, we work with the sub-sample used for learning individual trees. We investigate the performance of a random forest based LtR algorithm as we reduce the size of the sub-samples used for learning individual trees. Experiments on Letor data sets reveal that substantial reduction of training time can be achieved using only small amount of data training data. Not only that, the accuracy is likely to increase while maintaining the same level of performance stability as the baseline. Thus in addition to the existing benefit of being completely parallelizable, this study empirically discovers yet another ingredient of random forest based LtR algorithms for making them one of the top contenders for large scale LtR.

    Original languageEnglish
    Title of host publicationData Mining and Analytics 2014
    Subtitle of host publicationProceedings of the Twelfth Australasian Data Mining Conference (AusDM’14), Brisbane, Australia, 27-28 November 2014
    EditorsXue Li, Lin Liu, Kok-Leong Ong, Yanchang Zhao
    Place of PublicationNew South Wales, Australia
    PublisherAustralian Computer Society Inc
    Pages91-99
    Number of pages9
    ISBN (Electronic)9781921770173
    Publication statusPublished - 2014
    EventAustralasian Data Mining Conference 2014 - Queensland University of Technology, Brisbane, Australia
    Duration: 27 Nov 201428 Nov 2014
    Conference number: 12th
    https://dblp.org/db/conf/ausdm/ausdm2014.html (Proceedings)

    Publication series

    NameConferences in Research and Practice in Information Technology (CRPIT)
    PublisherAustralian Computer Society Inc.
    Volume158
    ISSN (Print)1445-1336

    Conference

    ConferenceAustralasian Data Mining Conference 2014
    Abbreviated titleAusDM 2014
    Country/TerritoryAustralia
    CityBrisbane
    Period27/11/1428/11/14
    Internet address

    Keywords

    • Bootstrapping
    • Correlation
    • Learning-to-rank
    • Random forest
    • Scalability
    • Sub-sampling

    Cite this