LDA-PSTR: A Topic Modeling Method for Short Text

Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in w...

Full description

Saved in:

Bibliographic Details
Published in	Advanced Data Mining and Applications Vol. 11323; pp. 339 - 352
Main Authors	Zhou, Kai, Yang, Qun
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2018 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Frequent pattern LDA Short text Text representation Topic modeling
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in which context information of words are ignored. Moreover, when directly applied to short text, it will arise the lack of co-occurrence patterns problem due to the sparseness of unigrams representations. Existing work either performs data expansion by utilizing external knowledge resource, or simply aggregates these semantically related short texts. These methods generally produce low-quality topic representation or suffer from poor semantically correlation between different data resource. In this paper, we propose a different method that is computationally efficient and effective. Our method applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words. We use these frequent patterns as feature units to represent texts, referred as pattern set-based text representation (PSTR). Besides that, in order to represent text more precisely, we propose a new probabilistic topic model called LDA-PSTR. And an improved Gibbs algorithm has been developed for LDA-PSTR. Experiments on different corpus show that such an approach can discover more prominent and coherent topics, and achieve significant performance improvement on several evaluation metrics.
ISBN:	9783030050894 3030050890
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-05090-0_29