Abstract
GitHub is a large and popular open source project platform, which hosts various open source projects. Despite the prevalence of GitHub platform, not every project has gained high popularity. Identification of popular projects on GitHub can help developers choose proper projects to follow or contribute to, as well as provide guidance in building a popular project. In this paper, we propose an approach to predict the popularity of GitHub projects. We first conducted online surveys with GitHub users to determine the threshold (the number of stars of a project) of popular and unpopular projects. Next, we extract 35 features from both GitHub and Stack Overflow, which are divided into three dimensions: project, evolutionary, and project owner. A random forest classifier is built based on these features to identify popular GitHub projects. To evaluate the performance of our approach, we collect a large-scale dataset from GitHub which contains a total of 409,784 GitHub projects and 174,784 GitHub users. Our model achieves an average AUC of 0.76, which statistically significantly improves state-of-the-art by a substantial margin. We also study which features are of the most importance in distinguishing popular projects from unpopular ones. Experimental results show that number of branches, number of open issues, and number of contributors play the most important roles in identification of popular projects, and all of them have large effect size.
Original language | English |
---|---|
Title of host publication | Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019 |
Editors | Vladimir Getov, Jean-Luc Gaudiot, Nariyoshi Yamai, Stelvio Cimato, Morris Chang, Yuuichi Teranishi, Ji-Jiang Yang, Hong Va Leong, Hossian Shahriar, Michiharu Takemoto, Dave Towey, Hiroki Takakura, Atilla Elci, Susumu Takeuchi, Satish Puri |
Place of Publication | Piscataway NJ USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 21-26 |
Number of pages | 6 |
ISBN (Electronic) | 9781728126074 |
DOIs | |
Publication status | Published - 2019 |
Event | International Computer Software and Applications Conference 2019 - Milwaukee, United States of America Duration: 15 Jul 2019 → 19 Jul 2019 Conference number: 43rd https://ieeecompsac.computer.org/2019/ https://ieeexplore.ieee.org/xpl/conhome/8746989/proceeding (Proceedings) |
Conference
Conference | International Computer Software and Applications Conference 2019 |
---|---|
Abbreviated title | COMPSAC 2019 |
Country/Territory | United States of America |
City | Milwaukee |
Period | 15/07/19 → 19/07/19 |
Internet address |
Keywords
- Feature Engineering
- GitHub Project
- Popularity
- Prediction Model