开放式地理实体关系抽取的Bootstrapping方法

从网络文本中抽取地理实体间空间关系和语义关系要求高时效性和强鲁棒性。本文提出一种开放式地理实体关系的自动抽取方法,通过bootstrapping技术统计词语的词性、位置和距离特征来计算语境中词语权值,据此确定描述地理实体关系的关键词,最终组织成结构化实例,并使用百度百科和Stanford CoreNLP开展了试验。研究结果表明,本文方法能自动挖掘自然语言的部分词法特征,无须领域专家知识和大规模标注语料,适用于未知关系类型的信息抽取任务;较之经典的Frequency、TFIDF和PPMI频率统计方法,精度和召回率分别提升约5%和23%。...

Full description

Saved in:
Bibliographic Details
Published in测绘学报 Vol. 45; no. 5; pp. 616 - 622
Main Author 余丽 陆锋 刘希亮
Format Journal Article
LanguageChinese
Published 中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室,北京 100101 2016
中国科学院大学,北京 100101%中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室,北京 100101
江苏省地理信息资源开发与利用协同创新中心,江苏 南京 210023%中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室,北京,100101
Subjects
Online AccessGet full text
ISSN1001-1595

Cover

More Information
Summary:从网络文本中抽取地理实体间空间关系和语义关系要求高时效性和强鲁棒性。本文提出一种开放式地理实体关系的自动抽取方法,通过bootstrapping技术统计词语的词性、位置和距离特征来计算语境中词语权值,据此确定描述地理实体关系的关键词,最终组织成结构化实例,并使用百度百科和Stanford CoreNLP开展了试验。研究结果表明,本文方法能自动挖掘自然语言的部分词法特征,无须领域专家知识和大规模标注语料,适用于未知关系类型的信息抽取任务;较之经典的Frequency、TFIDF和PPMI频率统计方法,精度和召回率分别提升约5%和23%。
Bibliography:11-2089/P
Extracting spatial relations and semantic relations between two geo-entities from Web texts,asks robust and effective solutions.This paper puts forward a novel approach:firstly,the characteristics of terms(part-of-speech,position and distance)are analyzed by means of bootstrapping.Secondly,the weight of each term is calculated and the keyword is picked out as the clue of geo-entity relations.Thirdly,the geo-entity pairs and their keywords are organized into structured information.Finally,an experiment is conducted with Baidubaike and Stanford CoreNLP.The study shows that the presented method can automatically explore part of the lexical features and find additional relational terms which neither the domain expert knowledge nor large scale corpora need.Moreover,compared with three classical frequency statistics methods,namely Frequency,TF-IDF and PPMI,the precision and recall are improved about 5%and 23%respectively.
text mining; geo-entities; relation extraction; quantitative evaluation; bootstrapping
ISSN:1001-1595