Improve topic modeling algorithms based on Twitter hashtags

Today with increase using social media, a lot of researchers have interested in topic extraction from Twitter. Twitter is an unstructured short text and messy that it is critical to find topics from tweets. While topic modeling algorithms such as Latent semantic analysis (LSA) and Latent Dirichlet A...

Full description

Saved in:
Bibliographic Details
Published inJournal of physics. Conference series Vol. 1660; no. 1; pp. 12100 - 12108
Main Authors Alash, Hayder M, Al-Sultany, Ghaidaa A
Format Journal Article
LanguageEnglish
Published Bristol IOP Publishing 01.11.2020
Subjects
Online AccessGet full text
ISSN1742-6588
1742-6596
DOI10.1088/1742-6596/1660/1/012100

Cover

More Information
Summary:Today with increase using social media, a lot of researchers have interested in topic extraction from Twitter. Twitter is an unstructured short text and messy that it is critical to find topics from tweets. While topic modeling algorithms such as Latent semantic analysis (LSA) and Latent Dirichlet Allocation (LDA) are originally designed to derive topics from large documents such as articles, and books. They are often less efficient when applied to short text content like Twitter. Luckily, Twitter has many features that represent the interaction between users. Tweets have rich user-generated hashtags as keywords. In this paper, we exploit the hashtags feature to improve topics learned from Twitter content without modifying the basic topic model of LSA and LDA. Users who share the same hashtag at most discuss the same topic. We compare the performance of the two methods (LSA and LDA) using the topic coherence ( with and without hashtags). The experiment result on the Twitter dataset showed that LSA has better coherence score with hashtags than that do not incorporate hashtags. In contrast, our experiments show that the LDA has a better coherence score without incorporating hashtags. Finally, LDA has a better coherence score than LSA and the best coherence result obtained from the LDA method was (0.6047) and the LSA method was (0.4744) but the number of topics in LDA was higher than LSA. Thus, LDA may cause the same tweets to discuss the same subject set into different clustering.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1742-6588
1742-6596
DOI:10.1088/1742-6596/1660/1/012100