Topic Recommendation for GitHub Repositories: How Far Can Extreme Multi-Label Learning Go?

GitHub is one of the most popular platforms for version control and collaboration. In GitHub, developers are able to assign related topics to their repositories, which is helpful for finding similar repositories. The topics that are assigned to repositories are varied and provide salient description...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) pp. 167 - 178
Main Authors	Widyasari, Ratnadira, Zhao, Zhipeng, Cong, Thanh Le, Jin Kang, Hong, Lo, David
Format	Conference Proceeding
Language	English
Published	IEEE 01.03.2023
Subjects	Analytical models Collaboration extreme multi-label learning GitHub repositories Measurement multi-label classification Organizations topic recommendation Training Training data XML
Online Access	Get full text

Cover

Loading…

More Information
Summary:	GitHub is one of the most popular platforms for version control and collaboration. In GitHub, developers are able to assign related topics to their repositories, which is helpful for finding similar repositories. The topics that are assigned to repositories are varied and provide salient descriptions of the repository; some topics describe the technology employed in a project, while others describe functionality of the project, its goals, and its features. Topics are part of the metadata of a repository and are useful for the organization and discoverability of the repository. However, the number of topics is large and this makes it challenging to assign a relevant set of topics to a repository. While prior studies filter out infrequently occurring topics before their experiments, we find that these topics form the majority of the data.In this study, we try to address the problem of identifying the topics from a GitHub repository by treating it as an extreme multi-label learning (XML) problem. We collect data of 21K GitHub repositories containing 37K labels of topics. The main challenge for XML is a large number of possible labels and severe data sparsity which fit the challenge of identification of topics from the GitHub repository. We evaluate multiple XML techniques, such as Parabel, Bonsai, LightXML, and ZestXML. We then perform an analysis of the different models proposed for XML classification. The best results on all the metrics from XML models are from ZestXML which is a combination of zero-shot and XML. We also compare the performance of ZestXML with a baseline from a recent study. The results show that ZestXML improves the baseline in terms of the average F1-score by 17.35%. We also find that for the repositories that have topics that rarely appear in the repositories used during training, ZestXML improves the performance greatly. The average of F1-score is 3 times higher as compared to the baseline for the topics with 20 or less occurrences in training data.
ISSN:	2640-7574
DOI:	10.1109/SANER56733.2023.00025