Biological gene extraction path based on knowledge graph and natural language processing

The continuous progress of society and the vigorous development of science and technology have brought people the dawn of maintaining health and preventing and controlling diseases. At the same time, with the update and iteration of bioinformatics technology, the current biological gene research has...

Full description

Saved in:

Bibliographic Details
Published in	Frontiers in genetics Vol. 13; p. 1086379
Main Authors	Zhang, Canlin, Cao, Xiaopei
Format	Journal Article
Language	English
Published	Switzerland Frontiers Media S.A 13.01.2023
Subjects	biological gene biological gene extraction Genetics knowledge graph natural language processing path research biological gene biological gene extraction path research natural language processing knowledge graph
Online Access	Get full text

Cover

Loading…

Abstract	The continuous progress of society and the vigorous development of science and technology have brought people the dawn of maintaining health and preventing and controlling diseases. At the same time, with the update and iteration of bioinformatics technology, the current biological gene research has also undergone revolutionary changes. However, a long-standing problem in genetic research has always plagued researchers, that is, how to find the most needed sample genes from a large number of sample genes, so as to reduce unnecessary research and reduce research costs. By studying the extraction path of biological genes, it can help researchers to extract the most valuable research genes and avoid wasting time and energy. In order to solve the above problems, this paper used the Bhattacharyya distance index and the Gini index to screen the sample genes when extracting the characteristic genes of breast cancer. In the selected 49 public genes, 6 principal components were extracted by principal component analysis (PCA), and finally the experimental results were tested. It was found that when the optimal number of characteristic genes was selected as 5, the recognition rate of genes reached the highest 90.31%, which met the experimental requirements. In addition, the experiment also proved that the characteristic gene extraction method designed in this paper had a removal rate of 99.75% of redundant genes, which can greatly reduce the time and money cost of research.
AbstractList	The continuous progress of society and the vigorous development of science and technology have brought people the dawn of maintaining health and preventing and controlling diseases. At the same time, with the update and iteration of bioinformatics technology, the current biological gene research has also undergone revolutionary changes. However, a long-standing problem in genetic research has always plagued researchers, that is, how to find the most needed sample genes from a large number of sample genes, so as to reduce unnecessary research and reduce research costs. By studying the extraction path of biological genes, it can help researchers to extract the most valuable research genes and avoid wasting time and energy. In order to solve the above problems, this paper used the Bhattacharyya distance index and the Gini index to screen the sample genes when extracting the characteristic genes of breast cancer. In the selected 49 public genes, 6 principal components were extracted by principal component analysis (PCA), and finally the experimental results were tested. It was found that when the optimal number of characteristic genes was selected as 5, the recognition rate of genes reached the highest 90.31%, which met the experimental requirements. In addition, the experiment also proved that the characteristic gene extraction method designed in this paper had a removal rate of 99.75% of redundant genes, which can greatly reduce the time and money cost of research.The continuous progress of society and the vigorous development of science and technology have brought people the dawn of maintaining health and preventing and controlling diseases. At the same time, with the update and iteration of bioinformatics technology, the current biological gene research has also undergone revolutionary changes. However, a long-standing problem in genetic research has always plagued researchers, that is, how to find the most needed sample genes from a large number of sample genes, so as to reduce unnecessary research and reduce research costs. By studying the extraction path of biological genes, it can help researchers to extract the most valuable research genes and avoid wasting time and energy. In order to solve the above problems, this paper used the Bhattacharyya distance index and the Gini index to screen the sample genes when extracting the characteristic genes of breast cancer. In the selected 49 public genes, 6 principal components were extracted by principal component analysis (PCA), and finally the experimental results were tested. It was found that when the optimal number of characteristic genes was selected as 5, the recognition rate of genes reached the highest 90.31%, which met the experimental requirements. In addition, the experiment also proved that the characteristic gene extraction method designed in this paper had a removal rate of 99.75% of redundant genes, which can greatly reduce the time and money cost of research. The continuous progress of society and the vigorous development of science and technology have brought people the dawn of maintaining health and preventing and controlling diseases. At the same time, with the update and iteration of bioinformatics technology, the current biological gene research has also undergone revolutionary changes. However, a long-standing problem in genetic research has always plagued researchers, that is, how to find the most needed sample genes from a large number of sample genes, so as to reduce unnecessary research and reduce research costs. By studying the extraction path of biological genes, it can help researchers to extract the most valuable research genes and avoid wasting time and energy. In order to solve the above problems, this paper used the Bhattacharyya distance index and the Gini index to screen the sample genes when extracting the characteristic genes of breast cancer. In the selected 49 public genes, 6 principal components were extracted by principal component analysis (PCA), and finally the experimental results were tested. It was found that when the optimal number of characteristic genes was selected as 5, the recognition rate of genes reached the highest 90.31%, which met the experimental requirements. In addition, the experiment also proved that the characteristic gene extraction method designed in this paper had a removal rate of 99.75% of redundant genes, which can greatly reduce the time and money cost of research.
Author	Zhang, Canlin Cao, Xiaopei
AuthorAffiliation	1 Sorenson Communications , Salt Lake City , UT , United States 2 College of Creative Culture and Communication , Zhejiang Normal University , Jinhua , Zhejiang , China
AuthorAffiliation_xml	– name: 2 College of Creative Culture and Communication , Zhejiang Normal University , Jinhua , Zhejiang , China – name: 1 Sorenson Communications , Salt Lake City , UT , United States
Author_xml	– sequence: 1 givenname: Canlin surname: Zhang fullname: Zhang, Canlin – sequence: 2 givenname: Xiaopei surname: Cao fullname: Cao, Xiaopei
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/36712855$$D View this record in MEDLINE/PubMed
BookMark	eNpVkk1P3DAQhq2KqlDKH-BQ5djLbv2d-FKpRbRFQuqFQ2_WxJlkDVk7tZMW_j1edovAF3_MO89YM-97chRiQELOGV0L0ZjP_YAB15xyvma00aI2b8gJ01quGsrZ0YvzMTnL-ZaWJY0QQr4jx0LXjDdKnZDf33wc4-AdjNWOWOH9nMDNPoZqgnlTtZCxq8rtLsR_I3YDVkOCaVNB6KoA85JK5ghhWKCEphQd5uzD8IG87WHMeHbYT8nN98ubi5-r618_ri6-Xq-cZGZedVwhb3WrG2mo0x0zXHVGtxRpA4oqEKJ8lSnmZA-My4ZTVWI1dOiQojglV3tsF-HWTslvIT3YCN4-PcQ0WEizdyNa6qjSjikjJZd9rQGl6yk4YKCV62VhfdmzpqXdYucwlFaMr6CvI8Fv7BD_WtM0lOq6AD4dACn-WTDPduuzw7G0B-OSLa_rMivJa16kH1_Wei7yfzJFwPcCl2LOCftnCaN25wD75AC7c4A9OEA8ApKWpRY
Cites_doi	10.1016/j.cageo.2017.12.007 10.13057/biodiv/d200820 10.1111/jems.12259 10.1109/mci.2018.2840738 10.1016/j.sjbs.2021.01.036 10.21307/jofnem-2018-055 10.1164/rccm.201610-2006OC 10.1155/2017/5072427 10.1177/1724600820925095 10.3923/jbs.2020.13.21 10.1587/transinf.2017swp0006 10.23919/tst.2017.7889640 10.1007/s00521-020-05101-4 10.3233/sw-160218 10.1002/phar.2151 10.1039/c7sc02701j 10.1007/s11390-017-1718-y 10.3329/dujbs.v26i1.46349 10.1145/3132733 10.1080/09168451.2017.1353401 10.1007/s10579-017-9381-z
ContentType	Journal Article
Copyright	Copyright © 2023 Zhang and Cao. Copyright © 2023 Zhang and Cao. 2023 Zhang and Cao
Copyright_xml	– notice: Copyright © 2023 Zhang and Cao. – notice: Copyright © 2023 Zhang and Cao. 2023 Zhang and Cao
DBID	AAYXX CITATION NPM 7X8 5PM DOA
DOI	10.3389/fgene.2022.1086379
DatabaseName	CrossRef PubMed MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef PubMed MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic CrossRef PubMed
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Biology
DocumentTitleAlternate	Zhang and Cao
EISSN	1664-8021
ExternalDocumentID	oai_doaj_org_article_0c056c1594424f76ae4cf0aca1a65cf4 PMC9880067 36712855 10_3389_fgene_2022_1086379
Genre	Journal Article
GroupedDBID	53G 5VS 9T4 AAFWJ AAKDD AAYXX ACGFS ACXDI ADBBV ADRAZ AFPKN ALMA_UNASSIGNED_HOLDINGS AOIJS BAWUL BCNDV CITATION DIK EMOBN GROUPED_DOAJ GX1 HYE KQ8 M48 M~E OK1 PGMZT RNS RPM IPNFZ NPM RIG 7X8 5PM
ID	FETCH-LOGICAL-c419t-d25e2b6b68490c6d1925d96b0e08a505a33712151c4fa12482050e07adece0e3
IEDL.DBID	M48
ISSN	1664-8021
IngestDate	Wed Aug 27 01:19:13 EDT 2025 Thu Aug 21 18:38:34 EDT 2025 Fri Jul 11 02:46:20 EDT 2025 Mon Jul 21 05:42:44 EDT 2025 Tue Jul 01 02:19:30 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Keywords	biological gene biological gene extraction path research natural language processing knowledge graph
Language	English
License	Copyright © 2023 Zhang and Cao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c419t-d25e2b6b68490c6d1925d96b0e08a505a33712151c4fa12482050e07adece0e3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Edited by: Deepak Kumar Jain, Chongqing University of Posts and Telecommunications, China This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics Reviewed by: Lei Shi, Luliang University, China Fenghui Dong, Nanjing Forestry University, China Tiefeng Wu, Qingdao University of Technology, China
OpenAccessLink	http://journals.scholarsportal.info/openUrl.xqy?doi=10.3389/fgene.2022.1086379
PMID	36712855
PQID	2771084272
PQPubID	23479
ParticipantIDs	doaj_primary_oai_doaj_org_article_0c056c1594424f76ae4cf0aca1a65cf4 pubmedcentral_primary_oai_pubmedcentral_nih_gov_9880067 proquest_miscellaneous_2771084272 pubmed_primary_36712855 crossref_primary_10_3389_fgene_2022_1086379
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2023-01-13
PublicationDateYYYYMMDD	2023-01-13
PublicationDate_xml	– month: 01 year: 2023 text: 2023-01-13 day: 13
PublicationDecade	2020
PublicationPlace	Switzerland
PublicationPlace_xml	– name: Switzerland
PublicationTitle	Frontiers in genetics
PublicationTitleAlternate	Front Genet
PublicationYear	2023
Publisher	Frontiers Media S.A
Publisher_xml	– name: Frontiers Media S.A
References	Wong (B21) 2018; 38 Paulheim (B15) 2017; 8 Hasan (B8) 2017; 26 Balsmeieri (B3) 2018; 27 Xu (B22) 2020; 35 Zhu (B24) 2017; 2017 Diamantopoulos (B5) 2017; 51 Zhang (B23) 2017; 22 Lin (B11) 2017; 8 UzmaAl-Obeidat (B18) 2022; 34 Wang (B19) 2018; 112 Lin (B12) 2017; 32 Jia (B9) 2017; 12 Shi (B16) 2017; 81 Ebigwai (B7) 2020; 20 Johnny (B10) 2017; 2017 Natthawut (B13) 2018; 101 Nuaima (B14) 2018; 50 Cai (B4) 2019; 25 Tom (B17) 2018; 13 Do (B6) 2020; 21 Abbas (B1) 2019; 20 AlMarshad (B2) 2021; 28 Wi (B20) 2017; 196
References_xml	– volume: 21 start-page: 5344 year: 2020 ident: B6 article-title: Indigenous Lien Minh chicken of Vietnam: Phenotypic characteristics and single nucleotide polymorphisms of GH, IGFBP and PIT candidate genes related to growth traits publication-title: Biodiversitas J. Biol. Divers. – volume: 112 start-page: 112 year: 2018 ident: B19 article-title: Information extraction and knowledge graph construction from geoscience literature publication-title: Comput. Geosciences doi: 10.1016/j.cageo.2017.12.007 – volume: 20 start-page: 2249 year: 2019 ident: B1 article-title: Phylogenetic of sago palm (Metroxylon sagu) and others monocotyledon based on mitochondrial nad2 gene markers publication-title: Biodiversitas J. Biol. Divers. doi: 10.13057/biodiv/d200820 – volume: 27 start-page: 535 year: 2018 ident: B3 article-title: Machine learning and natural language processing on the patent corpus: Data, tools, and new measures publication-title: J. Econ. Manag. Strategy doi: 10.1111/jems.12259 – volume: 25 start-page: 971 year: 2019 ident: B4 article-title: Dynamic change in the gene expression profile of rat benign prostate hyperplasia tissue after complete denervation publication-title: Zhonghua nan ke xue = Natl. J. Androl. – volume: 13 start-page: 55 year: 2018 ident: B17 article-title: Recent trends in deep learning based Natural Language Processing publication-title: IEEE Comput. Intell. Mag. doi: 10.1109/mci.2018.2840738 – volume: 28 start-page: 2388 year: 2021 ident: B2 article-title: Association of polymorphisms in genes involved in enamel formation, taste preference and immune response with early childhood caries in Saudi pre-school children publication-title: Saudi J. Biol. Sci. doi: 10.1016/j.sjbs.2021.01.036 – volume: 50 start-page: 517 year: 2018 ident: B14 article-title: Effector gene vap1 based DGGE fingerprinting to assess variation within and among Heterodera schachtii populations publication-title: J. nematology doi: 10.21307/jofnem-2018-055 – volume: 196 start-page: 430 year: 2017 ident: B20 article-title: Application of a Natural Language Processing algorithm to asthma ascertainment: An automated chart review publication-title: Am. J. Respir. Crit. Care Med. doi: 10.1164/rccm.201610-2006OC – volume: 2017 start-page: 1 year: 2017 ident: B24 article-title: Intelligent learning for knowledge graph towards geological data publication-title: Sci. Program. doi: 10.1155/2017/5072427 – volume: 35 start-page: 14 year: 2020 ident: B22 article-title: The landscape of gene mutations and clinical significance of tumor mutation burden in patients with soft tissue sarcoma who underwent surgical resection and received conventional adjuvant therapy publication-title: Int. J. Biol. Markers doi: 10.1177/1724600820925095 – volume: 20 start-page: 13 year: 2020 ident: B7 article-title: Resolving taxonomic ambiguity between two morphological similar plant taxa using maturase K gene analysis publication-title: J. Biol. Sci. doi: 10.3923/jbs.2020.13.21 – volume: 101 start-page: 90 year: 2018 ident: B13 article-title: An automatic knowledge graph creation framework from Natural Language text publication-title: Ieice Trans. Inf. Syst. doi: 10.1587/transinf.2017swp0006 – volume: 22 start-page: 185 year: 2017 ident: B23 article-title: Knowledge graph embedding for hyper-relational data publication-title: Tsinghua Sci. Technol. doi: 10.23919/tst.2017.7889640 – volume: 2017 start-page: 641 year: 2017 ident: B10 article-title: Detection of suicidality in adolescents with autism spectrum disorders: Developing a Natural Language Processing approach for use in electronic health records publication-title: AMIA Symp. – volume: 34 start-page: 8309 year: 2022 ident: B18 article-title: Gene encoder: A feature selection technique through unsupervised deep learning-based clustering for large gene expression data publication-title: Neural Comput. Applic doi: 10.1007/s00521-020-05101-4 – volume: 8 start-page: 489 year: 2017 ident: B15 article-title: Knowledge graph refinement: A survey of approaches and evaluation methods publication-title: Semantic Web doi: 10.3233/sw-160218 – volume: 38 start-page: 822 year: 2018 ident: B21 article-title: Natural Language processing and its implications for the future of medication safety: A narrative review of recent advances and challenges publication-title: Pharmacother. J. Hum. Pharmacol. Drug Ther. doi: 10.1002/phar.2151 – volume: 8 start-page: 6670 year: 2017 ident: B11 article-title: Simultaneous visualization of the subfemtomolar expression of microRNA and microRNA target gene using HILO microscopy publication-title: Chem. Sci. doi: 10.1039/c7sc02701j – volume: 32 start-page: 242 year: 2017 ident: B12 article-title: Intelligent development environment and software knowledge graph publication-title: J. Comput. Sci. Technol. doi: 10.1007/s11390-017-1718-y – volume: 26 start-page: 45 year: 2017 ident: B8 article-title: 16S rRNA gene sequence based identification of Vibrio spp. in shrimp and tilapia hatcheries of Bangladesh publication-title: Dhaka Univ. J. Biol. Sci. doi: 10.3329/dujbs.v26i1.46349 – volume: 12 start-page: 1 year: 2017 ident: B9 article-title: Knowledge graph embedding: A locally and temporally adaptive translation-based approach publication-title: ACM Trans. Web doi: 10.1145/3132733 – volume: 81 start-page: 1721 year: 2017 ident: B16 article-title: Enhanced rutin accumulation in tobacco leaves by overexpressing the NtFLS2 gene publication-title: Bioence Biotechnol. Biochem. doi: 10.1080/09168451.2017.1353401 – volume: 51 start-page: 495 year: 2017 ident: B5 article-title: Software requirements as an application domain for natural language processing publication-title: Lang. Resour. Eval. doi: 10.1007/s10579-017-9381-z
SSID	ssj0000493334
Score	2.3021798
Snippet	The continuous progress of society and the vigorous development of science and technology have brought people the dawn of maintaining health and preventing and...
SourceID	doaj pubmedcentral proquest pubmed crossref
SourceType	Open Website Open Access Repository Aggregation Database Index Database
StartPage	1086379
SubjectTerms	biological gene biological gene extraction Genetics knowledge graph natural language processing path research
SummonAdditionalLinks	– databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3NS8MwFA8yELyI39YvIniTsjZfbY4qjiHoacJuIU1T1EM35nbwv_e9pBubCF68tU1LX99r83svTX4_Qm641TUHJEurktWpgAoltWXJU88Vym0XXje4dvj5RQ1fxdNYjtekvnBOWKQHjo7rZw4g2gHoCsFEUyjrhWsy62xulXRNYAIFzFsrpj5i3ss5F3GVDFRhut9APJAWk7EgLsRx7tYaEgXC_t-yzJ-TJdfQZ7BHdru0kd5Fc_fJlm8PyHYUkvw6JOO4hQ6naAGFLncWlyxQ1BymCFY1hb3VGBoNVNXUtjUN5J5w5XLskk7j6gFAtSMyGjyOHoZpp5mQOpHreVoz6VmlKlUKnTlVQwIna62qzGelhWzHcl4goUTuRGMB2yEBkNBW2No7n3l-THrtpPWnhOZBnqrhFVJ2WSZLAC4oP5huKlSs5gm5XbrPTCMzhoGKAp1tgrMNOtt0zk7IPXp4dSayWocDEGvTxdr8FeuEXC_jY-ArwF8btvWTxadhBWRKpWAFS8hJjNfqVlzBE5dSJqTYiOSGLZst7ftbYNrW0LsBnJ_9h_HnZAel6nH4JucXpDefLfwlJDTz6iq8u987mfMd priority: 102 providerName: Directory of Open Access Journals
Title	Biological gene extraction path based on knowledge graph and natural language processing
URI	https://www.ncbi.nlm.nih.gov/pubmed/36712855 https://www.proquest.com/docview/2771084272 https://pubmed.ncbi.nlm.nih.gov/PMC9880067 https://doaj.org/article/0c056c1594424f76ae4cf0aca1a65cf4
Volume	13
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3PT9swFH6CTpO4IGBjhF_ypN2mQGI7dnJACNAATWInkHqzHMcBpimF0krw3_OenVYrYpddqrSu1eS92t_3nPj7AL4JWzUCkSytS96kEiuU1JalSL1QZLetfdXS3uGrX-ryRv4cFsMlmNkd9QF8ere0Iz-pm_Gfg-fHl2Mc8EdUcSLeHrYYalK85Dz4BgldLcMHRCZNjgZXPd3_HdmwEELGvTP_6LqAT0HG_z3u-fYRyr8w6XwNVnsyyU5i9tdhyXcb8DHaS758gmE8ojQwOgOGE_E4bmRg5ETMCMIahu_mK2ssCFgz2zUsSH5iz9mKJnuIewoQ6z7D9fmP67PLtHdSSJ3Mq0na8MLzWtWqlFXmVIO0rmgqVWc-Ky1yICuEJpmJ3MnWIuIjLSiwTdvGO595sQmDbtT5LWB5MK1qRU1CXpYXJcIZFiW8amvysRYJfJ-FzzxEvQyDdQYF24RgGwq26YOdwClFeP5N0roOH4zGt6YfOiZzSNIc0i4puWy1sl66NrPO5lYVrpUJfJ3lx-DYoBsetvOj6ZPhGvlTKbnmCXyJ-Zr_lFB4xWVRJKAXMrlwLost3f1d0N-ucM5DkN_-7547sEKu9bSSk4tdGEzGU7-H3GZS74c1AXy9GOb74c_7CoKc-9M
linkProvider	Scholars Portal
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Biological+gene+extraction+path+based+on+knowledge+graph+and+natural+language+processing&rft.jtitle=Frontiers+in+genetics&rft.au=Zhang%2C+Canlin&rft.au=Cao%2C+Xiaopei&rft.date=2023-01-13&rft.pub=Frontiers+Media+S.A&rft.eissn=1664-8021&rft.volume=13&rft_id=info:doi/10.3389%2Ffgene.2022.1086379&rft.externalDocID=PMC9880067
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1664-8021&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1664-8021&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1664-8021&client=summon