DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain

Larger numbers of geoscience reports create challenges and opportunities for data analysis and knowledge discovery. Segmenting texts into semantically and syntactically meaningful words is known as the Chinese word segmentation (CWS) problem because there is no space between words in the Chinese lan...

Full description

Saved in:

Bibliographic Details
Published in	Computers & geosciences Vol. 121; pp. 1 - 11
Main Authors	Qiu, Qinjun, Xie, Zhong, Wu, Liang, Li, Wenjia
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.12.2018
Subjects	algorithms Chinese word segmentation computers data collection Geoscience reports information retrieval Natural language processing Unigram language model Geoscience reports Natural language processing Chinese word segmentation Unigram language model
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Larger numbers of geoscience reports create challenges and opportunities for data analysis and knowledge discovery. Segmenting texts into semantically and syntactically meaningful words is known as the Chinese word segmentation (CWS) problem because there is no space between words in the Chinese language. CWS is a crucial first step toward natural language processing (NLP). Although the available generic segmenters can process geoscience reports, their performance degrades dramatically without sufficient domain knowledge. Hence, developing effective segmenters remains a challenge and requires more work. This inspired us to build a segmenter for the geoscience subject domain. By integrating the unigram language model and deep learning, we propose a weakly supervised model: DGeoSegmenter. DGeoSegmenter is trained with words and corresponding frequencies. We built DGeoSegmenter using the bi-directional long short-term memory (Bi-LSTM) model, which randomly extracts words and combines them into sentences. Our evaluation results using geoscience reports and benchmark datasets demonstrate the effectiveness of our method, DGeoSegmenter can segment both geoscience terms and general terms. Since manually labeled datasets and hand-crafted rules are not necessary for this proposed algorithm, it can easily be applied to various domains including information retrieval and text mining. •Segmenting geoscience texts into words based on deep learning and unigram language model first.•Combination sentences with words and frequencies.•The methodology can easily be scaled/transferred to other subject domains.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0098-3004
DOI:	10.1016/j.cageo.2018.08.006