Characterization and prediction of popular projects on GitHub

Junxiao Han, Shuiguang Deng, Xin Xia, Dongjing Wang, Jianwei Yin

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

GitHub is a large and popular open source project platform, which hosts various open source projects. Despite the prevalence of GitHub platform, not every project has gained high popularity. Identification of popular projects on GitHub can help developers choose proper projects to follow or contribute to, as well as provide guidance in building a popular project. In this paper, we propose an approach to predict the popularity of GitHub projects. We first conducted online surveys with GitHub users to determine the threshold (the number of stars of a project) of popular and unpopular projects. Next, we extract 35 features from both GitHub and Stack Overflow, which are divided into three dimensions: project, evolutionary, and project owner. A random forest classifier is built based on these features to identify popular GitHub projects. To evaluate the performance of our approach, we collect a large-scale dataset from GitHub which contains a total of 409,784 GitHub projects and 174,784 GitHub users. Our model achieves an average AUC of 0.76, which statistically significantly improves state-of-the-art by a substantial margin. We also study which features are of the most importance in distinguishing popular projects from unpopular ones. Experimental results show that number of branches, number of open issues, and number of contributors play the most important roles in identification of popular projects, and all of them have large effect size.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019
EditorsVladimir Getov, Jean-Luc Gaudiot, Nariyoshi Yamai, Stelvio Cimato, Morris Chang, Yuuichi Teranishi, Ji-Jiang Yang, Hong Va Leong, Hossian Shahriar, Michiharu Takemoto, Dave Towey, Hiroki Takakura, Atilla Elci, Susumu Takeuchi, Satish Puri
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages21-26
Number of pages6
ISBN (Electronic)9781728126074
DOIs
Publication statusPublished - 2019
EventInternational Computer Software and Applications Conference 2019 - Milwaukee, United States of America
Duration: 15 Jul 201919 Jul 2019
Conference number: 43rd
https://ieeecompsac.computer.org/2019/

Conference

ConferenceInternational Computer Software and Applications Conference 2019
Abbreviated titleCOMPSAC 2019
CountryUnited States of America
CityMilwaukee
Period15/07/1919/07/19
Internet address

Keywords

  • Feature Engineering
  • GitHub Project
  • Popularity
  • Prediction Model

Cite this

Han, J., Deng, S., Xia, X., Wang, D., & Yin, J. (2019). Characterization and prediction of popular projects on GitHub. In V. Getov, J-L. Gaudiot, N. Yamai, S. Cimato, M. Chang, Y. Teranishi, J-J. Yang, H. V. Leong, H. Shahriar, M. Takemoto, D. Towey, H. Takakura, A. Elci, S. Takeuchi, & S. Puri (Eds.), Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019 (pp. 21-26). [8754436] IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/COMPSAC.2019.00013