Abstract
This work proposes to predict the tags assigned for the posts on Stack Overflow platform. The raw data was obtained from the stackexchange.com including more than 50K posts and their associated tags given by the users. The posts' questions and titles are pre-processed, and the sentences in the posts are further transformed into features via Latent Dirichlet Allocation. The problem is a multi-class and multi-label classification and hence, we propose 1) one-against-all models for 15 most popularly used tags, and 2) a combined multi-tag classifier for finding the top K tags for a single post. Three algorithms are used to train the one-against-all classifiers to decide to what extent a post belongs to a tag. The probabilities of each post belonging to a tag are then combined to give the results of the multi-tag classifier with the best performing algorithm. The performance is compared with a baseline approach (kNN). Our multi-tag classifier achieves 55% recall and 39% F1-score.
Original language | English |
---|---|
Title of host publication | Proceedings - 2020 IEEE/ACM 42nd International Conference on Software Engineering Workshops, ICSEW 2020 |
Publisher | Association for Computing Machinery, Inc |
Pages | 489-493 |
Number of pages | 5 |
ISBN (Electronic) | 9781450379632 |
DOIs | |
Publication status | Published - 27 Jun 2020 |
Event | 42nd IEEE/ACM International Conference on Software Engineering Workshops, ICSEW 2020 - Seoul, Korea, Republic of Duration: 27 Jun 2020 → 19 Jul 2020 |
Publication series
Name | Proceedings - 2020 IEEE/ACM 42nd International Conference on Software Engineering Workshops, ICSEW 2020 |
---|
Conference
Conference | 42nd IEEE/ACM International Conference on Software Engineering Workshops, ICSEW 2020 |
---|---|
Country/Territory | Korea, Republic of |
City | Seoul |
Period | 27/06/20 → 19/07/20 |
Bibliographical note
Publisher Copyright:© 2020 ACM.
Keywords
- Latent Dirichlet Allocation
- Stack Overflow
- tag prediction