LDA-PSTR: A Topic Modeling Method for Short Text

Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in w...

Full description

Saved in:
Bibliographic Details
Published inAdvanced Data Mining and Applications Vol. 11323; pp. 339 - 352
Main Authors Zhou, Kai, Yang, Qun
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 2018
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in which context information of words are ignored. Moreover, when directly applied to short text, it will arise the lack of co-occurrence patterns problem due to the sparseness of unigrams representations. Existing work either performs data expansion by utilizing external knowledge resource, or simply aggregates these semantically related short texts. These methods generally produce low-quality topic representation or suffer from poor semantically correlation between different data resource. In this paper, we propose a different method that is computationally efficient and effective. Our method applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words. We use these frequent patterns as feature units to represent texts, referred as pattern set-based text representation (PSTR). Besides that, in order to represent text more precisely, we propose a new probabilistic topic model called LDA-PSTR. And an improved Gibbs algorithm has been developed for LDA-PSTR. Experiments on different corpus show that such an approach can discover more prominent and coherent topics, and achieve significant performance improvement on several evaluation metrics.
ISBN:9783030050894
3030050890
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-030-05090-0_29