Probabilistic Word Selection via Topic Modeling

We propose selective supervised Latent Dirichlet Allocation (ssLDA) to boost the prediction performance of the widely studied supervised probabilistic topic models. We introduce a Bernoulli distribution for each word in one given document to selectthis word as a strongly or weakly discriminative one...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on knowledge and data engineering Vol. 27; no. 6; pp. 1643 - 1655
Main Authors	Zhuang, Yueting, Gao, Haidong, Wu, Fei, Tang, Siliang, Zhang, Yin, Zhang, Zhongfei
Format	Journal Article
Language	English
Published	New York IEEE 01.06.2015 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Analytical models Classification Convergence Data models Dirichlet problem Discrimination Equations Flats Latent Dirichlet Allocation Mathematical models Predictive models Probabilistic logic Probabilistic methods Probability distribution Probability theory Resource management Supervised learning Topic modeling Topic modeling latent Dirichlet allocation classification supervised learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We propose selective supervised Latent Dirichlet Allocation (ssLDA) to boost the prediction performance of the widely studied supervised probabilistic topic models. We introduce a Bernoulli distribution for each word in one given document to selectthis word as a strongly or weakly discriminative one with respect to its assigned topic. The Bernoulli distribution is parameterized by the discrimination power of the word for its assigned topic. As a result, the document is represented as a "bag-of-selective-words" instead of the probabilistic "bag-of-topics" in the topic modeling domain or the flat "bag-of-words" in the traditional natural language processing domain to form a new perspective. Inheriting the general framework of supervised LDA (sLDA), ssLDA can also predict many types of response specified by a Gaussian Linear Model (GLM). Focusing on the utilization of this word selection mechanism for singe-label document classification in this paper, we conduct the variational inference for approximating the intractable posterior and derive a maximum-likelihood estimation of parameters in ssLDA. The experiments reported on textual documents show that ssLDA not only performs competitively over "state-of-the-art" classification approaches based on both the flat "bag-of-words" and probabilistic "bag-of-topics" representation in terms of classification performance, but also has the ability to discover the discrimination power of the words specified in the topics (compatible with our rational knowledge).
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2014.2377727