A hybrid method of combination probability and machine learning for Chinese geological text segmentation
To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text segmentation, a knowledge- and data-driven word segmentation method by combining combination probability and machine learning was proposed in this paper....
Saved in:
Published in | Computers & geosciences Vol. 183; p. 105512 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
01.01.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text segmentation, a knowledge- and data-driven word segmentation method by combining combination probability and machine learning was proposed in this paper. We extracted mathematical feature information from terms in Chinese geological text to construct a Term Combination Probability Model (TCPM) for Chinese word combinations by integrating the combination features of geological terms and the Chinese writing styles under zero-sample conditions. The TCPM was used to extract geological terms with high combination characteristics as a user-defined dictionary, and then a geological corpus was constructed by using a general domain word segmentation method based on this dictionary. After a small amount of manual review and optimization, the geological corpus was trained with a BiLSTM-CRF model to segment Chinese geological text. The proposed method in this paper was tested using a regional geological survey report set in Henan Province, and the precision, recall, and F1-score of the method are 92.65%, 92.53%, and 92.59%, respectively. The experimental results demonstrated that this method, combined with the inherent knowledge features of geological text and the machine learning method, can assist in expanding the core dictionary for Chinese geological text segmentation based on zero-sample, and can improve the segmentation precision of Chinese geological text compared to simple general word segmentation methods or machine learning methods. |
---|---|
AbstractList | To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text segmentation, a knowledge- and data-driven word segmentation method by combining combination probability and machine learning was proposed in this paper. We extracted mathematical feature information from terms in Chinese geological text to construct a Term Combination Probability Model (TCPM) for Chinese word combinations by integrating the combination features of geological terms and the Chinese writing styles under zero-sample conditions. The TCPM was used to extract geological terms with high combination characteristics as a user-defined dictionary, and then a geological corpus was constructed by using a general domain word segmentation method based on this dictionary. After a small amount of manual review and optimization, the geological corpus was trained with a BiLSTM-CRF model to segment Chinese geological text. The proposed method in this paper was tested using a regional geological survey report set in Henan Province, and the precision, recall, and F1-score of the method are 92.65%, 92.53%, and 92.59%, respectively. The experimental results demonstrated that this method, combined with the inherent knowledge features of geological text and the machine learning method, can assist in expanding the core dictionary for Chinese geological text segmentation based on zero-sample, and can improve the segmentation precision of Chinese geological text compared to simple general word segmentation methods or machine learning methods. |
ArticleNumber | 105512 |
Author | Zou, Yu Guo, Zhiyong Tang, Yu Deng, Jiqiu |
Author_xml | – sequence: 1 givenname: Zhiyong orcidid: 0000-0002-9213-7842 surname: Guo fullname: Guo, Zhiyong – sequence: 2 givenname: Jiqiu surname: Deng fullname: Deng, Jiqiu – sequence: 3 givenname: Yu surname: Zou fullname: Zou, Yu – sequence: 4 givenname: Yu surname: Tang fullname: Tang, Yu |
BookMark | eNotkD1rwzAYhDWk0KTtL-iisYvTV3ot2xlD6BcEurSzkGXZVrClVFKg-fe1k04Hx3F3PCuycN4ZQh4ZrBmw4vmw1qozfs2B4-QIwfiCLAE2VYYA-S1ZxXgAAM4rsST9lvbnOtiGjib1vqG-pdqPtXUqWe_oMfha1Xaw6UyVm1JK99YZOhgVnHUdbX2gu9mKhk6zg--sVgNN5jfRaLrRuHRpuic3rRqiefjXO_L9-vK1e8_2n28fu-0-U1yIlJVYTNdyoZBVrMhLM302WFSaY9PkmmHDOZSAhSi0gkmrtgWoGaJodAMV3pGna-_0_OdkYpKjjdoMg3LGn6JEyAHLEjdzFK9RHXyMwbTyGOyowlkykDNLeZAXlnJmKa8s8Q9hoGzU |
Cites_doi | 10.1109/ACCESS.2019.2943721 10.1016/j.cageo.2014.11.005 10.1162/neco.1997.9.8.1735 10.3390/cells8020122 10.1162/tacl_a_00104 10.1029/2022EA002511 10.1515/opar-2015-0010 10.1016/S0959-440X(96)80056-X 10.1029/2021EA001673 10.1016/j.jbi.2020.103665 10.1016/j.jbi.2004.08.003 |
ContentType | Journal Article |
DBID | AAYXX CITATION 7S9 L.6 |
DOI | 10.1016/j.cageo.2023.105512 |
DatabaseName | CrossRef AGRICOLA AGRICOLA - Academic |
DatabaseTitle | CrossRef AGRICOLA AGRICOLA - Academic |
DatabaseTitleList | AGRICOLA |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Geology |
ExternalDocumentID | 10_1016_j_cageo_2023_105512 |
GeographicLocations | China |
GeographicLocations_xml | – name: China |
GroupedDBID | --K --M .DC .~1 0R~ 1B1 1RT 1~. 1~5 29F 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AABNK AAEDT AAEDW AAHBH AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AATTM AAXKI AAXUO AAYFN AAYWO AAYXX ABBOA ABFNM ABJNI ABMAC ABQEM ABQYD ABWVN ABXDB ACDAQ ACGFS ACLVX ACNNM ACRLP ACRPL ACSBN ACVFH ACZNC ADBBV ADCNI ADEZE ADJOM ADMUD ADNMO ADXHL AEBSH AEIPS AEKER AENEX AEUPX AFJKZ AFPUW AFTJW AFXIZ AGCQF AGHFR AGQPQ AGRNS AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIGII AIIUN AIKHN AITUG AKBMS AKRWK AKYEP ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU AOUOD APXCP ASPBG ATOGT AVWKF AXJTR AZFZN BKOJK BLXMC BNPGV CITATION CS3 DU5 EBS EFJIC EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q GBLVA GBOLZ HLZ HMA HVGLF HZ~ IHE IMUCA J1W KOM LG9 LY3 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SEP SES SEW SPC SPCBC SSE SSH SSV SSZ T5K TN5 WUQ ZCA ZMT ~02 ~G- 7S9 EFKBS L.6 |
ID | FETCH-LOGICAL-a255t-73600245a3181647e300e368c23dd4c13d220703656ca00368ff00b1335dcd083 |
ISSN | 0098-3004 |
IngestDate | Mon Jul 21 11:09:43 EDT 2025 Tue Jul 01 02:26:51 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-a255t-73600245a3181647e300e368c23dd4c13d220703656ca00368ff00b1335dcd083 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ORCID | 0000-0002-9213-7842 |
PQID | 3040377398 |
PQPubID | 24069 |
ParticipantIDs | proquest_miscellaneous_3040377398 crossref_primary_10_1016_j_cageo_2023_105512 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2024-01-00 20240101 |
PublicationDateYYYYMMDD | 2024-01-01 |
PublicationDate_xml | – month: 01 year: 2024 text: 2024-01-00 |
PublicationDecade | 2020 |
PublicationTitle | Computers & geosciences |
PublicationYear | 2024 |
References | Deping (10.1016/j.cageo.2023.105512_bib8) 2021; 46 Huang (10.1016/j.cageo.2023.105512_bib14) 2015; 76 Tsuruoka (10.1016/j.cageo.2023.105512_bib30) 2004; 37 Zhao (10.1016/j.cageo.2023.105512_bib38) 2009; 36 Wang (10.1016/j.cageo.2023.105512_bib31) 2020; 19 Chen (10.1016/j.cageo.2023.105512_bib3) 2021; 113 Mu (10.1016/j.cageo.2023.105512_bib23) 2019; 7 Maosong (10.1016/j.cageo.2023.105512_bib20) 2000; 14 Wang (10.1016/j.cageo.2023.105512_bib32) 2018; 35 Hochreiter (10.1016/j.cageo.2023.105512_bib12) 1997; 9 He (10.1016/j.cageo.2023.105512_bib10) 2021 Sang (10.1016/j.cageo.2023.105512_bib27) 2003 Mengel (10.1016/j.cageo.2023.105512_bib21) 2009 Chen (10.1016/j.cageo.2023.105512_bib4) 2015; vol. 1 Wei (10.1016/j.cageo.2023.105512_bib34) 2022; 9 Xie (10.1016/j.cageo.2023.105512_bib35) 2017 Zhang (10.1016/j.cageo.2023.105512_bib37) 2023 Eddy (10.1016/j.cageo.2023.105512_bib9) 1996; 6 Murrieta-Flores (10.1016/j.cageo.2023.105512_bib24) 2015; 1 Yao (10.1016/j.cageo.2023.105512_bib36) 2016 Hu (10.1016/j.cageo.2023.105512_bib13) 2017; 38 Sun (10.1016/j.cageo.2023.105512_bib28) Deng (10.1016/j.cageo.2023.105512_bib7) Huang (10.1016/j.cageo.2023.105512_bib15) 2015 Luo (10.1016/j.cageo.2023.105512_bib19) 2019 Mikolov (10.1016/j.cageo.2023.105512_bib22) 2011 Chiu (10.1016/j.cageo.2023.105512_bib6) 2016; 4 Li (10.1016/j.cageo.2023.105512_bib17) 2015; 34 Niu (10.1016/j.cageo.2023.105512_bib25) 2009 He (10.1016/j.cageo.2023.105512_bib11) 2015; 32 Tian (10.1016/j.cageo.2023.105512_bib29) 2020 Wang (10.1016/j.cageo.2023.105512_bib33) 2019; 8 Borthwick (10.1016/j.cageo.2023.105512_bib1) 1999 Chen (10.1016/j.cageo.2023.105512_bib2) 2018; 101 Chen (10.1016/j.cageo.2023.105512_bib5) 2015 Li (10.1016/j.cageo.2023.105512_bib18) 2021; 8 Lafferty (10.1016/j.cageo.2023.105512_bib16) 2001 Olson (10.1016/j.cageo.2023.105512_bib26) 2008 |
References_xml | – volume: 7 start-page: 146524 year: 2019 ident: 10.1016/j.cageo.2023.105512_bib23 article-title: A character-level BiLSTM-CRF model with multi-representations for Chinese event detection publication-title: IEEE Access doi: 10.1109/ACCESS.2019.2943721 – start-page: 345 year: 2016 ident: 10.1016/j.cageo.2023.105512_bib36 article-title: Bi-directional LSTM recurrent neural network for Chinese word segmentation – volume: 36 start-page: 77 year: 2009 ident: 10.1016/j.cageo.2023.105512_bib38 article-title: Research of Chinese word segmentation based on double-array trie publication-title: J. Hunan Univ. – start-page: 5528 year: 2011 ident: 10.1016/j.cageo.2023.105512_bib22 – volume: 19 start-page: 8 year: 2020 ident: 10.1016/j.cageo.2023.105512_bib31 article-title: A method of geologic words identification based on statistics publication-title: Software Guide – year: 2020 ident: 10.1016/j.cageo.2023.105512_bib29 article-title: Improving Chinese word segmentation with wordhood memory networks – year: 2003 ident: 10.1016/j.cageo.2023.105512_bib27 – volume: 32 start-page: 179 year: 2015 ident: 10.1016/j.cageo.2023.105512_bib11 article-title: Geographic entity recognition method based on CRF model and rules combination publication-title: Appl. Res. Comput. – ident: 10.1016/j.cageo.2023.105512_bib7 – volume: 76 start-page: 11 year: 2015 ident: 10.1016/j.cageo.2023.105512_bib14 article-title: GeoSegmenter: a statistically learned Chinese word segmenter for the geoscience domain publication-title: Comput. Geosci. doi: 10.1016/j.cageo.2014.11.005 – ident: 10.1016/j.cageo.2023.105512_bib28 – volume: 14 start-page: 1 year: 2000 ident: 10.1016/j.cageo.2023.105512_bib20 article-title: An experimental study on dictionary mechanism for Chinese word segmentation publication-title: J. Chin. Inf. Process. – volume: 9 start-page: 1735 year: 1997 ident: 10.1016/j.cageo.2023.105512_bib12 article-title: Long short-term memory publication-title: Neural Comput. doi: 10.1162/neco.1997.9.8.1735 – volume: 8 start-page: 122 year: 2019 ident: 10.1016/j.cageo.2023.105512_bib33 article-title: A high efficient biological language model for predicting protein–protein interactions publication-title: Cells doi: 10.3390/cells8020122 – start-page: 138 year: 2008 ident: 10.1016/j.cageo.2023.105512_bib26 – volume: 4 start-page: 357 year: 2016 ident: 10.1016/j.cageo.2023.105512_bib6 article-title: Named entity recognition with bidirectional LSTM-CNNs publication-title: Transactions of the association for computational linguistics doi: 10.1162/tacl_a_00104 – volume: 9 year: 2022 ident: 10.1016/j.cageo.2023.105512_bib34 article-title: GeoBERTSegmenter: word segmentation of Chinese texts in the geoscience domain using the improved BERT model publication-title: Earth Space Sci. doi: 10.1029/2022EA002511 – volume: 101 start-page: 69 year: 2018 ident: 10.1016/j.cageo.2023.105512_bib2 article-title: Research on segmentation of geological mineral text using conditional random field publication-title: China Mining Magazine – start-page: 260 year: 2017 ident: 10.1016/j.cageo.2023.105512_bib35 article-title: New word detection in ancient Chinese literature, web and big data: first international joint conference, APWeb-WAIM 2017 – volume: 1 year: 2015 ident: 10.1016/j.cageo.2023.105512_bib24 article-title: Further frontiers in GIS: extending spatial analysis to textual sources in archaeology publication-title: Open Archaeol. doi: 10.1515/opar-2015-0010 – start-page: 219 year: 2009 ident: 10.1016/j.cageo.2023.105512_bib21 article-title: Extracting structured data from web pages with maximum entropy segmental markov model, Web Information Systems Engineering-WISE 2009: 10th International Conference – volume: 35 start-page: 1 year: 2018 ident: 10.1016/j.cageo.2023.105512_bib32 article-title: Review of Chinese word segmentation algorithms publication-title: Group Technol. Prod. Mod. – year: 2001 ident: 10.1016/j.cageo.2023.105512_bib16 – start-page: 5555 year: 2021 ident: 10.1016/j.cageo.2023.105512_bib10 article-title: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders – volume: 6 start-page: 361 year: 1996 ident: 10.1016/j.cageo.2023.105512_bib9 article-title: Hidden markov models publication-title: Curr. Opin. Struct. Biol. doi: 10.1016/S0959-440X(96)80056-X – volume: 8 year: 2021 ident: 10.1016/j.cageo.2023.105512_bib18 article-title: Chinese word segmentation based on self‐learning model and geological knowledge for the geoscience domain publication-title: Earth Space Sci. doi: 10.1029/2021EA001673 – year: 1999 ident: 10.1016/j.cageo.2023.105512_bib1 – volume: vol. 1 start-page: 1744 year: 2015 ident: 10.1016/j.cageo.2023.105512_bib4 article-title: Gated recursive neural network for Chinese word segmentation – start-page: 1 year: 2023 ident: 10.1016/j.cageo.2023.105512_bib37 article-title: A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts publication-title: Spatial Sci. – volume: 46 start-page: 3039 year: 2021 ident: 10.1016/j.cageo.2023.105512_bib8 article-title: Geological entity recognition based on ELMO-CNN-BILSTM-CRF model publication-title: Geoscience – volume: 113 year: 2021 ident: 10.1016/j.cageo.2023.105512_bib3 article-title: Domain specific word embeddings for natural language processing in radiology publication-title: J. Biomed. Inf. doi: 10.1016/j.jbi.2020.103665 – start-page: 993 year: 2009 ident: 10.1016/j.cageo.2023.105512_bib25 – volume: 37 start-page: 461 year: 2004 ident: 10.1016/j.cageo.2023.105512_bib30 article-title: Improving the performance of dictionary-based approaches in protein name recognition publication-title: J. Biomed. Inf. doi: 10.1016/j.jbi.2004.08.003 – volume: 38 start-page: 522 year: 2017 ident: 10.1016/j.cageo.2023.105512_bib13 article-title: Bidirectional recurrent networks for Chinese word segmentation publication-title: Journal of Chinese Computer Systems – year: 2019 ident: 10.1016/j.cageo.2023.105512_bib19 – volume: 34 start-page: 1288 year: 2015 ident: 10.1016/j.cageo.2023.105512_bib17 article-title: Big data application architecture and key technologies of intelligent geological survey publication-title: Geol. Bull. China – year: 2015 ident: 10.1016/j.cageo.2023.105512_bib15 article-title: Bidirectional LSTM-CRF models for sequence tagging publication-title: Computer Science – start-page: 1197 year: 2015 ident: 10.1016/j.cageo.2023.105512_bib5 article-title: Long short-term memory neural networks for Chinese word segmentation |
SSID | ssj0002285 |
Score | 2.3960116 |
Snippet | To address the issues surrounding incomplete coverage of core dictionaries, limited training corpora, and low precision in Chinese geological text... |
SourceID | proquest crossref |
SourceType | Aggregation Database Index Database |
StartPage | 105512 |
SubjectTerms | China probability surveys |
Title | A hybrid method of combination probability and machine learning for Chinese geological text segmentation |
URI | https://www.proquest.com/docview/3040377398 |
Volume | 183 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1JS8NAFB5cELyIK-6M4K2mNDPZeiyiFQ-eWhAvIcukCzRR2xzqwd_uezOTpVbEeglhSIYk7-Nt-d57hFy3sTgTK3oczw4My4Gz0HS50YqSkJlcAIQkQfbJeehbj8_2c0Ubk9Uls7AZffxYV_IfqcIayBWrZFeQbLkpLMA5yBeOIGE4_knGncZwjhVXeg60pohDrKukisNiVBtu1WRpIomTopgUoRiUOEBbTEVjIEo1iGSQxlQMJrouKa17sMUYiKkEDdylbWjpm3dzmX19GY7mmTaL6CgLTf0dvY3yMlud5dIElAs9nb3WKzobwaxaNkJr2LZnYBevRQ3LazoSR3Iq6vSS-laZhDGE5gNZmcl4s7p6sVn2NyNWUgsL1trYl5v4uImvNlknmwyCCdSGzc-KCMSYZxedVfHJi95UkgW49CSL_sui-ZY-SW-X7OhggnYUMvbImkj3yVZXSnJ-QIYdqvBBFT5oltAaPmgNHxTwQTU-aIEPCvigGh-0wgdFfNA6Pg5J__6ud_tg6MkaRgAh5MxwuSP_uQeg0bGhnID3FtzxIsbj2IpMHjOGtgCc_SjAlkVekrRaocm5HUcxeO1HZCPNUnFMaIIN9CNbJBiaW2EMDqdoR1bgtJibBKY4ITfF9_JfVQMV_xcZnZCr4pv6oOjw71WQiiyf-hzMDXdd3vZOV9vyjGxXSD0nG7P3XFyAJzkLLyUQvgB6N3Jf |
linkProvider | Elsevier |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+hybrid+method+of+combination+probability+and+machine+learning+for+Chinese+geological+text+segmentation&rft.jtitle=Computers+%26+geosciences&rft.au=Guo%2C+Zhiyong&rft.au=Deng%2C+Jiqiu&rft.au=Zou%2C+Yu&rft.au=Tang%2C+Yu&rft.date=2024-01-01&rft.issn=0098-3004&rft.volume=183&rft.spage=105512&rft_id=info:doi/10.1016%2Fj.cageo.2023.105512&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_cageo_2023_105512 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-3004&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-3004&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-3004&client=summon |