A hybrid method of combination probability and machine learning for Chinese geological text segmentation

To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text segmentation, a knowledge- and data-driven word segmentation method by combining combination probability and machine learning was proposed in this paper....

Full description

Saved in:
Bibliographic Details
Published inComputers & geosciences Vol. 183; p. 105512
Main Authors Guo, Zhiyong, Deng, Jiqiu, Zou, Yu, Tang, Yu
Format Journal Article
LanguageEnglish
Published 01.01.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text segmentation, a knowledge- and data-driven word segmentation method by combining combination probability and machine learning was proposed in this paper. We extracted mathematical feature information from terms in Chinese geological text to construct a Term Combination Probability Model (TCPM) for Chinese word combinations by integrating the combination features of geological terms and the Chinese writing styles under zero-sample conditions. The TCPM was used to extract geological terms with high combination characteristics as a user-defined dictionary, and then a geological corpus was constructed by using a general domain word segmentation method based on this dictionary. After a small amount of manual review and optimization, the geological corpus was trained with a BiLSTM-CRF model to segment Chinese geological text. The proposed method in this paper was tested using a regional geological survey report set in Henan Province, and the precision, recall, and F1-score of the method are 92.65%, 92.53%, and 92.59%, respectively. The experimental results demonstrated that this method, combined with the inherent knowledge features of geological text and the machine learning method, can assist in expanding the core dictionary for Chinese geological text segmentation based on zero-sample, and can improve the segmentation precision of Chinese geological text compared to simple general word segmentation methods or machine learning methods.
AbstractList To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text segmentation, a knowledge- and data-driven word segmentation method by combining combination probability and machine learning was proposed in this paper. We extracted mathematical feature information from terms in Chinese geological text to construct a Term Combination Probability Model (TCPM) for Chinese word combinations by integrating the combination features of geological terms and the Chinese writing styles under zero-sample conditions. The TCPM was used to extract geological terms with high combination characteristics as a user-defined dictionary, and then a geological corpus was constructed by using a general domain word segmentation method based on this dictionary. After a small amount of manual review and optimization, the geological corpus was trained with a BiLSTM-CRF model to segment Chinese geological text. The proposed method in this paper was tested using a regional geological survey report set in Henan Province, and the precision, recall, and F1-score of the method are 92.65%, 92.53%, and 92.59%, respectively. The experimental results demonstrated that this method, combined with the inherent knowledge features of geological text and the machine learning method, can assist in expanding the core dictionary for Chinese geological text segmentation based on zero-sample, and can improve the segmentation precision of Chinese geological text compared to simple general word segmentation methods or machine learning methods.
ArticleNumber 105512
Author Zou, Yu
Guo, Zhiyong
Tang, Yu
Deng, Jiqiu
Author_xml – sequence: 1
  givenname: Zhiyong
  orcidid: 0000-0002-9213-7842
  surname: Guo
  fullname: Guo, Zhiyong
– sequence: 2
  givenname: Jiqiu
  surname: Deng
  fullname: Deng, Jiqiu
– sequence: 3
  givenname: Yu
  surname: Zou
  fullname: Zou, Yu
– sequence: 4
  givenname: Yu
  surname: Tang
  fullname: Tang, Yu
BookMark eNotkD1rwzAYhDWk0KTtL-iisYvTV3ot2xlD6BcEurSzkGXZVrClVFKg-fe1k04Hx3F3PCuycN4ZQh4ZrBmw4vmw1qozfs2B4-QIwfiCLAE2VYYA-S1ZxXgAAM4rsST9lvbnOtiGjib1vqG-pdqPtXUqWe_oMfha1Xaw6UyVm1JK99YZOhgVnHUdbX2gu9mKhk6zg--sVgNN5jfRaLrRuHRpuic3rRqiefjXO_L9-vK1e8_2n28fu-0-U1yIlJVYTNdyoZBVrMhLM302WFSaY9PkmmHDOZSAhSi0gkmrtgWoGaJodAMV3pGna-_0_OdkYpKjjdoMg3LGn6JEyAHLEjdzFK9RHXyMwbTyGOyowlkykDNLeZAXlnJmKa8s8Q9hoGzU
Cites_doi 10.1109/ACCESS.2019.2943721
10.1016/j.cageo.2014.11.005
10.1162/neco.1997.9.8.1735
10.3390/cells8020122
10.1162/tacl_a_00104
10.1029/2022EA002511
10.1515/opar-2015-0010
10.1016/S0959-440X(96)80056-X
10.1029/2021EA001673
10.1016/j.jbi.2020.103665
10.1016/j.jbi.2004.08.003
ContentType Journal Article
DBID AAYXX
CITATION
7S9
L.6
DOI 10.1016/j.cageo.2023.105512
DatabaseName CrossRef
AGRICOLA
AGRICOLA - Academic
DatabaseTitle CrossRef
AGRICOLA
AGRICOLA - Academic
DatabaseTitleList AGRICOLA
DeliveryMethod fulltext_linktorsrc
Discipline Geology
ExternalDocumentID 10_1016_j_cageo_2023_105512
GeographicLocations China
GeographicLocations_xml – name: China
GroupedDBID --K
--M
.DC
.~1
0R~
1B1
1RT
1~.
1~5
29F
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AABNK
AAEDT
AAEDW
AAHBH
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AATTM
AAXKI
AAXUO
AAYFN
AAYWO
AAYXX
ABBOA
ABFNM
ABJNI
ABMAC
ABQEM
ABQYD
ABWVN
ABXDB
ACDAQ
ACGFS
ACLVX
ACNNM
ACRLP
ACRPL
ACSBN
ACVFH
ACZNC
ADBBV
ADCNI
ADEZE
ADJOM
ADMUD
ADNMO
ADXHL
AEBSH
AEIPS
AEKER
AENEX
AEUPX
AFJKZ
AFPUW
AFTJW
AFXIZ
AGCQF
AGHFR
AGQPQ
AGRNS
AGUBO
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIGII
AIIUN
AIKHN
AITUG
AKBMS
AKRWK
AKYEP
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
AOUOD
APXCP
ASPBG
ATOGT
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
BNPGV
CITATION
CS3
DU5
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
GBLVA
GBOLZ
HLZ
HMA
HVGLF
HZ~
IHE
IMUCA
J1W
KOM
LG9
LY3
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SEP
SES
SEW
SPC
SPCBC
SSE
SSH
SSV
SSZ
T5K
TN5
WUQ
ZCA
ZMT
~02
~G-
7S9
EFKBS
L.6
ID FETCH-LOGICAL-a255t-73600245a3181647e300e368c23dd4c13d220703656ca00368ff00b1335dcd083
ISSN 0098-3004
IngestDate Mon Jul 21 11:09:43 EDT 2025
Tue Jul 01 02:26:51 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-a255t-73600245a3181647e300e368c23dd4c13d220703656ca00368ff00b1335dcd083
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0002-9213-7842
PQID 3040377398
PQPubID 24069
ParticipantIDs proquest_miscellaneous_3040377398
crossref_primary_10_1016_j_cageo_2023_105512
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-01-00
20240101
PublicationDateYYYYMMDD 2024-01-01
PublicationDate_xml – month: 01
  year: 2024
  text: 2024-01-00
PublicationDecade 2020
PublicationTitle Computers & geosciences
PublicationYear 2024
References Deping (10.1016/j.cageo.2023.105512_bib8) 2021; 46
Huang (10.1016/j.cageo.2023.105512_bib14) 2015; 76
Tsuruoka (10.1016/j.cageo.2023.105512_bib30) 2004; 37
Zhao (10.1016/j.cageo.2023.105512_bib38) 2009; 36
Wang (10.1016/j.cageo.2023.105512_bib31) 2020; 19
Chen (10.1016/j.cageo.2023.105512_bib3) 2021; 113
Mu (10.1016/j.cageo.2023.105512_bib23) 2019; 7
Maosong (10.1016/j.cageo.2023.105512_bib20) 2000; 14
Wang (10.1016/j.cageo.2023.105512_bib32) 2018; 35
Hochreiter (10.1016/j.cageo.2023.105512_bib12) 1997; 9
He (10.1016/j.cageo.2023.105512_bib10) 2021
Sang (10.1016/j.cageo.2023.105512_bib27) 2003
Mengel (10.1016/j.cageo.2023.105512_bib21) 2009
Chen (10.1016/j.cageo.2023.105512_bib4) 2015; vol. 1
Wei (10.1016/j.cageo.2023.105512_bib34) 2022; 9
Xie (10.1016/j.cageo.2023.105512_bib35) 2017
Zhang (10.1016/j.cageo.2023.105512_bib37) 2023
Eddy (10.1016/j.cageo.2023.105512_bib9) 1996; 6
Murrieta-Flores (10.1016/j.cageo.2023.105512_bib24) 2015; 1
Yao (10.1016/j.cageo.2023.105512_bib36) 2016
Hu (10.1016/j.cageo.2023.105512_bib13) 2017; 38
Sun (10.1016/j.cageo.2023.105512_bib28)
Deng (10.1016/j.cageo.2023.105512_bib7)
Huang (10.1016/j.cageo.2023.105512_bib15) 2015
Luo (10.1016/j.cageo.2023.105512_bib19) 2019
Mikolov (10.1016/j.cageo.2023.105512_bib22) 2011
Chiu (10.1016/j.cageo.2023.105512_bib6) 2016; 4
Li (10.1016/j.cageo.2023.105512_bib17) 2015; 34
Niu (10.1016/j.cageo.2023.105512_bib25) 2009
He (10.1016/j.cageo.2023.105512_bib11) 2015; 32
Tian (10.1016/j.cageo.2023.105512_bib29) 2020
Wang (10.1016/j.cageo.2023.105512_bib33) 2019; 8
Borthwick (10.1016/j.cageo.2023.105512_bib1) 1999
Chen (10.1016/j.cageo.2023.105512_bib2) 2018; 101
Chen (10.1016/j.cageo.2023.105512_bib5) 2015
Li (10.1016/j.cageo.2023.105512_bib18) 2021; 8
Lafferty (10.1016/j.cageo.2023.105512_bib16) 2001
Olson (10.1016/j.cageo.2023.105512_bib26) 2008
References_xml – volume: 7
  start-page: 146524
  year: 2019
  ident: 10.1016/j.cageo.2023.105512_bib23
  article-title: A character-level BiLSTM-CRF model with multi-representations for Chinese event detection
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2019.2943721
– start-page: 345
  year: 2016
  ident: 10.1016/j.cageo.2023.105512_bib36
  article-title: Bi-directional LSTM recurrent neural network for Chinese word segmentation
– volume: 36
  start-page: 77
  year: 2009
  ident: 10.1016/j.cageo.2023.105512_bib38
  article-title: Research of Chinese word segmentation based on double-array trie
  publication-title: J. Hunan Univ.
– start-page: 5528
  year: 2011
  ident: 10.1016/j.cageo.2023.105512_bib22
– volume: 19
  start-page: 8
  year: 2020
  ident: 10.1016/j.cageo.2023.105512_bib31
  article-title: A method of geologic words identification based on statistics
  publication-title: Software Guide
– year: 2020
  ident: 10.1016/j.cageo.2023.105512_bib29
  article-title: Improving Chinese word segmentation with wordhood memory networks
– year: 2003
  ident: 10.1016/j.cageo.2023.105512_bib27
– volume: 32
  start-page: 179
  year: 2015
  ident: 10.1016/j.cageo.2023.105512_bib11
  article-title: Geographic entity recognition method based on CRF model and rules combination
  publication-title: Appl. Res. Comput.
– ident: 10.1016/j.cageo.2023.105512_bib7
– volume: 76
  start-page: 11
  year: 2015
  ident: 10.1016/j.cageo.2023.105512_bib14
  article-title: GeoSegmenter: a statistically learned Chinese word segmenter for the geoscience domain
  publication-title: Comput. Geosci.
  doi: 10.1016/j.cageo.2014.11.005
– ident: 10.1016/j.cageo.2023.105512_bib28
– volume: 14
  start-page: 1
  year: 2000
  ident: 10.1016/j.cageo.2023.105512_bib20
  article-title: An experimental study on dictionary mechanism for Chinese word segmentation
  publication-title: J. Chin. Inf. Process.
– volume: 9
  start-page: 1735
  year: 1997
  ident: 10.1016/j.cageo.2023.105512_bib12
  article-title: Long short-term memory
  publication-title: Neural Comput.
  doi: 10.1162/neco.1997.9.8.1735
– volume: 8
  start-page: 122
  year: 2019
  ident: 10.1016/j.cageo.2023.105512_bib33
  article-title: A high efficient biological language model for predicting protein–protein interactions
  publication-title: Cells
  doi: 10.3390/cells8020122
– start-page: 138
  year: 2008
  ident: 10.1016/j.cageo.2023.105512_bib26
– volume: 4
  start-page: 357
  year: 2016
  ident: 10.1016/j.cageo.2023.105512_bib6
  article-title: Named entity recognition with bidirectional LSTM-CNNs
  publication-title: Transactions of the association for computational linguistics
  doi: 10.1162/tacl_a_00104
– volume: 9
  year: 2022
  ident: 10.1016/j.cageo.2023.105512_bib34
  article-title: GeoBERTSegmenter: word segmentation of Chinese texts in the geoscience domain using the improved BERT model
  publication-title: Earth Space Sci.
  doi: 10.1029/2022EA002511
– volume: 101
  start-page: 69
  year: 2018
  ident: 10.1016/j.cageo.2023.105512_bib2
  article-title: Research on segmentation of geological mineral text using conditional random field
  publication-title: China Mining Magazine
– start-page: 260
  year: 2017
  ident: 10.1016/j.cageo.2023.105512_bib35
  article-title: New word detection in ancient Chinese literature, web and big data: first international joint conference, APWeb-WAIM 2017
– volume: 1
  year: 2015
  ident: 10.1016/j.cageo.2023.105512_bib24
  article-title: Further frontiers in GIS: extending spatial analysis to textual sources in archaeology
  publication-title: Open Archaeol.
  doi: 10.1515/opar-2015-0010
– start-page: 219
  year: 2009
  ident: 10.1016/j.cageo.2023.105512_bib21
  article-title: Extracting structured data from web pages with maximum entropy segmental markov model, Web Information Systems Engineering-WISE 2009: 10th International Conference
– volume: 35
  start-page: 1
  year: 2018
  ident: 10.1016/j.cageo.2023.105512_bib32
  article-title: Review of Chinese word segmentation algorithms
  publication-title: Group Technol. Prod. Mod.
– year: 2001
  ident: 10.1016/j.cageo.2023.105512_bib16
– start-page: 5555
  year: 2021
  ident: 10.1016/j.cageo.2023.105512_bib10
  article-title: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders
– volume: 6
  start-page: 361
  year: 1996
  ident: 10.1016/j.cageo.2023.105512_bib9
  article-title: Hidden markov models
  publication-title: Curr. Opin. Struct. Biol.
  doi: 10.1016/S0959-440X(96)80056-X
– volume: 8
  year: 2021
  ident: 10.1016/j.cageo.2023.105512_bib18
  article-title: Chinese word segmentation based on self‐learning model and geological knowledge for the geoscience domain
  publication-title: Earth Space Sci.
  doi: 10.1029/2021EA001673
– year: 1999
  ident: 10.1016/j.cageo.2023.105512_bib1
– volume: vol. 1
  start-page: 1744
  year: 2015
  ident: 10.1016/j.cageo.2023.105512_bib4
  article-title: Gated recursive neural network for Chinese word segmentation
– start-page: 1
  year: 2023
  ident: 10.1016/j.cageo.2023.105512_bib37
  article-title: A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts
  publication-title: Spatial Sci.
– volume: 46
  start-page: 3039
  year: 2021
  ident: 10.1016/j.cageo.2023.105512_bib8
  article-title: Geological entity recognition based on ELMO-CNN-BILSTM-CRF model
  publication-title: Geoscience
– volume: 113
  year: 2021
  ident: 10.1016/j.cageo.2023.105512_bib3
  article-title: Domain specific word embeddings for natural language processing in radiology
  publication-title: J. Biomed. Inf.
  doi: 10.1016/j.jbi.2020.103665
– start-page: 993
  year: 2009
  ident: 10.1016/j.cageo.2023.105512_bib25
– volume: 37
  start-page: 461
  year: 2004
  ident: 10.1016/j.cageo.2023.105512_bib30
  article-title: Improving the performance of dictionary-based approaches in protein name recognition
  publication-title: J. Biomed. Inf.
  doi: 10.1016/j.jbi.2004.08.003
– volume: 38
  start-page: 522
  year: 2017
  ident: 10.1016/j.cageo.2023.105512_bib13
  article-title: Bidirectional recurrent networks for Chinese word segmentation
  publication-title: Journal of Chinese Computer Systems
– year: 2019
  ident: 10.1016/j.cageo.2023.105512_bib19
– volume: 34
  start-page: 1288
  year: 2015
  ident: 10.1016/j.cageo.2023.105512_bib17
  article-title: Big data application architecture and key technologies of intelligent geological survey
  publication-title: Geol. Bull. China
– year: 2015
  ident: 10.1016/j.cageo.2023.105512_bib15
  article-title: Bidirectional LSTM-CRF models for sequence tagging
  publication-title: Computer Science
– start-page: 1197
  year: 2015
  ident: 10.1016/j.cageo.2023.105512_bib5
  article-title: Long short-term memory neural networks for Chinese word segmentation
SSID ssj0002285
Score 2.3960116
Snippet To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text...
SourceID proquest
crossref
SourceType Aggregation Database
Index Database
StartPage 105512
SubjectTerms China
probability
surveys
Title A hybrid method of combination probability and machine learning for Chinese geological text segmentation
URI https://www.proquest.com/docview/3040377398
Volume 183
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1JS8NAFB5cELyIK-6M4K2mNDPZeiyiFQ-eWhAvIcukCzRR2xzqwd_uezOTpVbEeglhSIYk7-Nt-d57hFy3sTgTK3oczw4My4Gz0HS50YqSkJlcAIQkQfbJeehbj8_2c0Ubk9Uls7AZffxYV_IfqcIayBWrZFeQbLkpLMA5yBeOIGE4_knGncZwjhVXeg60pohDrKukisNiVBtu1WRpIomTopgUoRiUOEBbTEVjIEo1iGSQxlQMJrouKa17sMUYiKkEDdylbWjpm3dzmX19GY7mmTaL6CgLTf0dvY3yMlud5dIElAs9nb3WKzobwaxaNkJr2LZnYBevRQ3LazoSR3Iq6vSS-laZhDGE5gNZmcl4s7p6sVn2NyNWUgsL1trYl5v4uImvNlknmwyCCdSGzc-KCMSYZxedVfHJi95UkgW49CSL_sui-ZY-SW-X7OhggnYUMvbImkj3yVZXSnJ-QIYdqvBBFT5oltAaPmgNHxTwQTU-aIEPCvigGh-0wgdFfNA6Pg5J__6ud_tg6MkaRgAh5MxwuSP_uQeg0bGhnID3FtzxIsbj2IpMHjOGtgCc_SjAlkVekrRaocm5HUcxeO1HZCPNUnFMaIIN9CNbJBiaW2EMDqdoR1bgtJibBKY4ITfF9_JfVQMV_xcZnZCr4pv6oOjw71WQiiyf-hzMDXdd3vZOV9vyjGxXSD0nG7P3XFyAJzkLLyUQvgB6N3Jf
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+hybrid+method+of+combination+probability+and+machine+learning+for+Chinese+geological+text+segmentation&rft.jtitle=Computers+%26+geosciences&rft.au=Guo%2C+Zhiyong&rft.au=Deng%2C+Jiqiu&rft.au=Zou%2C+Yu&rft.au=Tang%2C+Yu&rft.date=2024-01-01&rft.issn=0098-3004&rft.volume=183&rft.spage=105512&rft_id=info:doi/10.1016%2Fj.cageo.2023.105512&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_cageo_2023_105512
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-3004&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-3004&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-3004&client=summon