Probabilistic Word Selection via Topic Modeling

We propose selective supervised Latent Dirichlet Allocation (ssLDA) to boost the prediction performance of the widely studied supervised probabilistic topic models. We introduce a Bernoulli distribution for each word in one given document to selectthis word as a strongly or weakly discriminative one...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on knowledge and data engineering Vol. 27; no. 6; pp. 1643 - 1655
Main Authors Zhuang, Yueting, Gao, Haidong, Wu, Fei, Tang, Siliang, Zhang, Yin, Zhang, Zhongfei
Format Journal Article
LanguageEnglish
Published New York IEEE 01.06.2015
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We propose selective supervised Latent Dirichlet Allocation (ssLDA) to boost the prediction performance of the widely studied supervised probabilistic topic models. We introduce a Bernoulli distribution for each word in one given document to selectthis word as a strongly or weakly discriminative one with respect to its assigned topic. The Bernoulli distribution is parameterized by the discrimination power of the word for its assigned topic. As a result, the document is represented as a "bag-of-selective-words" instead of the probabilistic "bag-of-topics" in the topic modeling domain or the flat "bag-of-words" in the traditional natural language processing domain to form a new perspective. Inheriting the general framework of supervised LDA (sLDA), ssLDA can also predict many types of response specified by a Gaussian Linear Model (GLM). Focusing on the utilization of this word selection mechanism for singe-label document classification in this paper, we conduct the variational inference for approximating the intractable posterior and derive a maximum-likelihood estimation of parameters in ssLDA. The experiments reported on textual documents show that ssLDA not only performs competitively over "state-of-the-art" classification approaches based on both the flat "bag-of-words" and probabilistic "bag-of-topics" representation in terms of classification performance, but also has the ability to discover the discrimination power of the words specified in the topics (compatible with our rational knowledge).
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2014.2377727