CarD-T: Interpreting Carcinomic Lexicon via Transformers
The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (...
Saved in:
Published in | medRxiv : the preprint server for health sciences |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
31.08.2024
|
Subjects | |
Online Access | Get more information |
DOI | 10.1101/2024.08.13.24311948 |
Cover
Loading…
Abstract | The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations. |
---|---|
AbstractList | The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations. |
Author | O'Neill, Jamey Katira, Parag Tripathi, Osika Alexandrov, Ludmil Dhillon, Nermeeta Reddy, Gudur Ashrith |
Author_xml | – sequence: 1 givenname: Jamey surname: O'Neill fullname: O'Neill, Jamey organization: Department of Bioengineering, University of California San Diego, La Jolla, CA, USA – sequence: 2 givenname: Gudur Ashrith surname: Reddy fullname: Reddy, Gudur Ashrith organization: Department of Bioengineering, University of California San Diego, La Jolla, CA, USA – sequence: 3 givenname: Nermeeta surname: Dhillon fullname: Dhillon, Nermeeta organization: Mechanical Engineering Department, San Diego State University, San Diego, CA, USA – sequence: 4 givenname: Osika surname: Tripathi fullname: Tripathi, Osika organization: Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla, CA, USA – sequence: 5 givenname: Ludmil orcidid: 0000-0003-3596-4515 surname: Alexandrov fullname: Alexandrov, Ludmil organization: Sanford Stem Cell Institute, University of California San Diego, La Jolla, CA, USA – sequence: 6 givenname: Parag orcidid: 0000-0001-9873-5117 surname: Katira fullname: Katira, Parag organization: Computational Science Research Center, San Diego State University, San Diego, CA, USA |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39185518$$D View this record in MEDLINE/PubMed |
BookMark | eNo1j8tKxTAUALNQfFz9AkHyA605SRpP3El9XSi4qetLmpxIwKQlvYr-vYK6GpjFwJyygzIXYuwCRAsg4EoKqVuBLahWagVgNR6xY2UBuw7whGHv6l0z3vBt2VNdKu1TeeU_0qcy5-T5QJ_Jz4V_JMfH6soa55qprmfsMLq3lc7_uGEvD_dj_9QMz4_b_nZoMiiDDTopBSgIWgfrnLEhklfgIWqMBgMIJa6tmAiMtla7SYYOhetIoQJhJrlhl7_d5X3KFHZLTdnVr93_gvwGspBChA |
ContentType | Journal Article |
DBID | NPM |
DOI | 10.1101/2024.08.13.24311948 |
DatabaseName | PubMed |
DatabaseTitle | PubMed |
DatabaseTitleList | PubMed |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database |
DeliveryMethod | no_fulltext_linktorsrc |
ExternalDocumentID | 39185518 |
Genre | Journal Article Preprint |
GrantInformation_xml | – fundername: NIMHD NIH HHS grantid: U54 MD012397 – fundername: NIMHD NIH HHS grantid: S21 MD010690 – fundername: NCI NIH HHS grantid: U54 CA285117 |
GroupedDBID | NPM |
ID | FETCH-LOGICAL-m1368-8a220131d44d9aa69dfec31c1f48f68d1030790be164994ab2d580a5e383106b2 |
IngestDate | Sun Aug 17 02:24:25 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Keywords | Named Entity Recognition Biomedical language models Cancer epidemiology Bayesian analysis Carcinogen identification |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-m1368-8a220131d44d9aa69dfec31c1f48f68d1030790be164994ab2d580a5e383106b2 |
ORCID | 0000-0003-3596-4515 0000-0001-9873-5117 |
OpenAccessLink | https://www.medrxiv.org/content/10.1101/2024.08.13.24311948 |
PMID | 39185518 |
ParticipantIDs | pubmed_primary_39185518 |
PublicationCentury | 2000 |
PublicationDate | 2024-Aug-31 |
PublicationDateYYYYMMDD | 2024-08-31 |
PublicationDate_xml | – month: 08 year: 2024 text: 2024-Aug-31 day: 31 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States |
PublicationTitle | medRxiv : the preprint server for health sciences |
PublicationTitleAlternate | medRxiv |
PublicationYear | 2024 |
Score | 1.883633 |
SecondaryResourceType | preprint |
Snippet | The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning... |
SourceID | pubmed |
SourceType | Index Database |
Title | CarD-T: Interpreting Carcinomic Lexicon via Transformers |
URI | https://www.ncbi.nlm.nih.gov/pubmed/39185518 |
hasFullText | |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1ZS8NAEMcXqyC-iOJ9kQffQmqTbNJd36T1QLwoFXyTPTFoY4lRip_e2c3RUipUX0LItiXNL5udmcz8B6FjIrGSrBV6AYuUh8Ho9WhbC49FmjLJqcRWSPv2Lr56xNdP0dO4vZWtLsl5U3zPrCv5D1U4BlxNlewfyNY_CgdgH_jCFgjDdi7GHZZ1vb5x6uvcQfsa3_QHsuXG7o0aAerU_UpYoWNubNQy6b2ySWE97I2SL7dK8RgaocskzV0TsFWZTUQsyiXdcr2s7XCTKmPCyG9Vwm0doO8pWTy-Lz_lZwZ3wUuW5HXouWveQZSlXrA0KJXXq0M_S2yXZDN2_5G8ssm4RICrQOusp7LtBmA-ZORS_bAZgNni00Jic4LTcGBBhRSMiMifY3RKKrsaaqAGOA2mC-rDbak0BadwMuMEVtBy9aUpv8LaF_01tFo6Bs5ZQXkdLah0A5GC8KkzydcZ83VKvg7wdSb5bqLHi_N-58ore114Az-MiUdYEBjpI5gbkjIWU6mVCH3ha0x0TKTpBtemLa7AvaUUMx7IiLRgfoWmU1zMgy20mL6nagc5UaC5FjpisRBYSMwxl0S0NeU8VG3S2kXbxR99HhaCJs_VJdj7dWQfrYwZH6AlDTNIHYI5lvMje5l_AHTENK0 |
linkProvider | National Library of Medicine |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CarD-T%3A+Interpreting+Carcinomic+Lexicon+via+Transformers&rft.jtitle=medRxiv+%3A+the+preprint+server+for+health+sciences&rft.au=O%27Neill%2C+Jamey&rft.au=Reddy%2C+Gudur+Ashrith&rft.au=Dhillon%2C+Nermeeta&rft.au=Tripathi%2C+Osika&rft.date=2024-08-31&rft_id=info:doi/10.1101%2F2024.08.13.24311948&rft_id=info%3Apmid%2F39185518&rft_id=info%3Apmid%2F39185518&rft.externalDocID=39185518 |