Chinese New Word Identification： A Latent Discriminative Model with Global Features

Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are alwa...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 26; no. 1; pp. 14 - 24
Main Author	孙晓黄德根宋海玉任福继
Format	Journal Article
Language	English
Published	Boston Springer US 2011 Springer Nature B.V
Subjects	Analysis Artificial Intelligence Computer Science Conditional random fields Data Structures and Information Theory Discriminant analysis Information processing Information Systems Applications (incl.Internet) Internet Language Mathematical analysis Mathematical models Model testing Morphology Natural language (computers) Natural language processing POS机 Regular Paper RF应用 Software Engineering Studies Texts Theory of Computation Training Tuning Words (language) 信息爆炸判别模型慢性肾功能衰竭整体特征自然语言处理 China hidden semi-CRF global fragment features new word identification conditional random fields new words POS tagging
Online Access	Get full text
ISSN	1000-9000 1860-4749
DOI	10.1007/s11390-011-9411-z

Cover

More Information
Summary:	Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are always being created. The procedure of new words identification and POS tagging are usually separated and the features of lexical information cannot be fully used. A latent discriminative model, which combines the strengths of Latent Dynamic Conditional Random Field （LDCRF） and semi-CRF, is proposed to detect new words together with their POS synchronously regardless of the types of new words from Chinese text without being pre-segmented. Unlike semi-CRF, in proposed latent discriminative model, LDCRF is applied to generate candidate entities, which accelerates the training speed and decreases the computational cost. The complexity of proposed hidden semi-CRF could be further adjusted by tuning the number of hidden variables and the number of candidate entities from the Nbest outputs of LDCRF model. A new-word-generating framework is proposed for model training and testing, under which the definitions and distributions of new words conform to the ones in real text. The global feature called ＂Global Fragment Features＂ for new word identification is adopted. We tested our model on the corpus from SIGHAN-6. Experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags with satisfactory results. The proposed model performs competitively with the state-of-the-art models.
Bibliography:	new word identification, new words POS tagging, conditional random fields, hidden semi-CRF, global fragment features 11-2296/TP TP391 P544.4 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-011-9411-z