统计与规则相结合的维吾尔语人名识别方法

命名实体识别(Named entity recognition,NER)是自然语言处理(Natural language processing,NLP)中重要的任务,其中人名实体是主要的识别对象之一.本文从维吾尔语黏着性特点出发,从词干、音节、字符串三个角度对维吾尔语单词进行拆分,获得更小的语言单元,并把切分的新单元作为特征加入到条件随机场(Conditional random field,CRF)中,明显缓解了数据稀疏的影响,取得了比以单词为基本单元的人名识别方法更好的性能.同时还从维吾尔语中汉族人名的特点出发,提出了基于规则的维吾尔语中汉族人名的识别方法,最终利用统计和规则相结合的方法进一...

Full description

Saved in:
Bibliographic Details
Published in自动化学报 Vol. 43; no. 4; pp. 653 - 664
Main Author 塔什甫拉提·尼扎木丁 汪昆 艾斯卡尔·艾木都拉 帕力旦·吐尔逊
Format Journal Article
LanguageChinese
Published 新疆大学信息科学与工程学院 乌鲁木齐830046%中国科学院自动化研究所模式识别国家重点实验室 北京100190%新疆大学软件学院 乌鲁木齐830046 2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:命名实体识别(Named entity recognition,NER)是自然语言处理(Natural language processing,NLP)中重要的任务,其中人名实体是主要的识别对象之一.本文从维吾尔语黏着性特点出发,从词干、音节、字符串三个角度对维吾尔语单词进行拆分,获得更小的语言单元,并把切分的新单元作为特征加入到条件随机场(Conditional random field,CRF)中,明显缓解了数据稀疏的影响,取得了比以单词为基本单元的人名识别方法更好的性能.同时还从维吾尔语中汉族人名的特点出发,提出了基于规则的维吾尔语中汉族人名的识别方法,最终利用统计和规则相结合的方法进一步提高了识别的准确率.实验结果表明,该方法人名识别的准确率、召回率和F1值分别达到了87.47%、89.12%和88.29%.
Bibliography:Named entity recognition(NER) is an important subtask of natural language processing, where person name is one of the major objects. From agglutinative characteristics of the Uyghur language, we split a Uygur word into different level units such as syllable, suffix, stem, etc., so as to significantly reduce the data sparse problem. Since the Han people name is the major remaining errors for the CRF(Conditional random field)-based approach, we also propose a rule-based post-processing approach for Han people name recognition in Uyghur language. Experimental results show that this cascade approach achieves satisfactory performance, and that the recognition accuracy, recall rate and F 1 score are 87.47 %、89.12 % and 88.29 %, respectively.
TASHPOLAT Nizamidin1, WANG Kun2, ASKAR Hamdulla1, PALIDAN Tuerxun3 ( 1. Institute of Information Science and Engineering, Xinjiang University, Urumqi 830046 2. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190 3. Sch
ISSN:0254-4156
1874-1029
DOI:10.16383/j.aas.2017.c150769