LDA-PSTR: A Topic Modeling Method for Short Text
Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in w...
Saved in:
Published in | Advanced Data Mining and Applications Vol. 11323; pp. 339 - 352 |
---|---|
Main Authors | , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer International Publishing AG
2018
Springer International Publishing |
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in which context information of words are ignored. Moreover, when directly applied to short text, it will arise the lack of co-occurrence patterns problem due to the sparseness of unigrams representations. Existing work either performs data expansion by utilizing external knowledge resource, or simply aggregates these semantically related short texts. These methods generally produce low-quality topic representation or suffer from poor semantically correlation between different data resource. In this paper, we propose a different method that is computationally efficient and effective. Our method applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words. We use these frequent patterns as feature units to represent texts, referred as pattern set-based text representation (PSTR). Besides that, in order to represent text more precisely, we propose a new probabilistic topic model called LDA-PSTR. And an improved Gibbs algorithm has been developed for LDA-PSTR. Experiments on different corpus show that such an approach can discover more prominent and coherent topics, and achieve significant performance improvement on several evaluation metrics. |
---|---|
ISBN: | 9783030050894 3030050890 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-030-05090-0_29 |