DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain

Larger numbers of geoscience reports create challenges and opportunities for data analysis and knowledge discovery. Segmenting texts into semantically and syntactically meaningful words is known as the Chinese word segmentation (CWS) problem because there is no space between words in the Chinese lan...

Full description

Saved in:
Bibliographic Details
Published inComputers & geosciences Vol. 121; pp. 1 - 11
Main Authors Qiu, Qinjun, Xie, Zhong, Wu, Liang, Li, Wenjia
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.12.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Larger numbers of geoscience reports create challenges and opportunities for data analysis and knowledge discovery. Segmenting texts into semantically and syntactically meaningful words is known as the Chinese word segmentation (CWS) problem because there is no space between words in the Chinese language. CWS is a crucial first step toward natural language processing (NLP). Although the available generic segmenters can process geoscience reports, their performance degrades dramatically without sufficient domain knowledge. Hence, developing effective segmenters remains a challenge and requires more work. This inspired us to build a segmenter for the geoscience subject domain. By integrating the unigram language model and deep learning, we propose a weakly supervised model: DGeoSegmenter. DGeoSegmenter is trained with words and corresponding frequencies. We built DGeoSegmenter using the bi-directional long short-term memory (Bi-LSTM) model, which randomly extracts words and combines them into sentences. Our evaluation results using geoscience reports and benchmark datasets demonstrate the effectiveness of our method, DGeoSegmenter can segment both geoscience terms and general terms. Since manually labeled datasets and hand-crafted rules are not necessary for this proposed algorithm, it can easily be applied to various domains including information retrieval and text mining. •Segmenting geoscience texts into words based on deep learning and unigram language model first.•Combination sentences with words and frequencies.•The methodology can easily be scaled/transferred to other subject domains.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0098-3004
DOI:10.1016/j.cageo.2018.08.006