Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular a...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 33; no. 5; pp. 1007 - 1022
Main Authors	Li, Yang, Song, Wen-Zhuo, Yang, Bo
Format	Journal Article
Language	English
Published	New York Springer US 01.09.2018 Springer Springer Nature B.V College of Computer Science and Technology, Jilin University, Changchun 130012, China Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education Jilin University, Changchun 130012, China Aviation University of Air Force, Changchun 130062, China%College of Computer Science and Technology, Jilin University, Changchun 130012, China
Subjects	Artificial Intelligence Big Data Cloud computing Computational linguistics Computer Science Computer vision Data management Data mining Data processing Data Structures and Information Theory Datasets Dirichlet problem Inference Information Systems Applications (incl.Internet) Language processing Machine vision Natural language interfaces Natural language processing Online instruction Regular Paper Scale (ratio) Software Engineering Theory of Computation Training topic modeling cloud computing stochastic variational inference large-scale text classification online learning
Online Access	Get full text
ISSN	1000-9000 1860-4749
DOI	10.1007/s11390-018-1871-y

Cover

More Information
Summary:	Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-018-1871-y