CarD-T: Interpreting Carcinomic Lexicon via Transformers

The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (...

Full description

Saved in:
Bibliographic Details
Published inmedRxiv : the preprint server for health sciences
Main Authors O'Neill, Jamey, Reddy, Gudur Ashrith, Dhillon, Nermeeta, Tripathi, Osika, Alexandrov, Ludmil, Katira, Parag
Format Journal Article
LanguageEnglish
Published United States 31.08.2024
Subjects
Online AccessGet more information
DOI10.1101/2024.08.13.24311948

Cover

Loading…
Abstract The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations.
AbstractList The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations.
Author O'Neill, Jamey
Katira, Parag
Tripathi, Osika
Alexandrov, Ludmil
Dhillon, Nermeeta
Reddy, Gudur Ashrith
Author_xml – sequence: 1
  givenname: Jamey
  surname: O'Neill
  fullname: O'Neill, Jamey
  organization: Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
– sequence: 2
  givenname: Gudur Ashrith
  surname: Reddy
  fullname: Reddy, Gudur Ashrith
  organization: Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
– sequence: 3
  givenname: Nermeeta
  surname: Dhillon
  fullname: Dhillon, Nermeeta
  organization: Mechanical Engineering Department, San Diego State University, San Diego, CA, USA
– sequence: 4
  givenname: Osika
  surname: Tripathi
  fullname: Tripathi, Osika
  organization: Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla, CA, USA
– sequence: 5
  givenname: Ludmil
  orcidid: 0000-0003-3596-4515
  surname: Alexandrov
  fullname: Alexandrov, Ludmil
  organization: Sanford Stem Cell Institute, University of California San Diego, La Jolla, CA, USA
– sequence: 6
  givenname: Parag
  orcidid: 0000-0001-9873-5117
  surname: Katira
  fullname: Katira, Parag
  organization: Computational Science Research Center, San Diego State University, San Diego, CA, USA
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39185518$$D View this record in MEDLINE/PubMed
BookMark eNo1j8tKxTAUALNQfFz9AkHyA605SRpP3El9XSi4qetLmpxIwKQlvYr-vYK6GpjFwJyygzIXYuwCRAsg4EoKqVuBLahWagVgNR6xY2UBuw7whGHv6l0z3vBt2VNdKu1TeeU_0qcy5-T5QJ_Jz4V_JMfH6soa55qprmfsMLq3lc7_uGEvD_dj_9QMz4_b_nZoMiiDDTopBSgIWgfrnLEhklfgIWqMBgMIJa6tmAiMtla7SYYOhetIoQJhJrlhl7_d5X3KFHZLTdnVr93_gvwGspBChA
ContentType Journal Article
DBID NPM
DOI 10.1101/2024.08.13.24311948
DatabaseName PubMed
DatabaseTitle PubMed
DatabaseTitleList PubMed
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
DeliveryMethod no_fulltext_linktorsrc
ExternalDocumentID 39185518
Genre Journal Article
Preprint
GrantInformation_xml – fundername: NIMHD NIH HHS
  grantid: U54 MD012397
– fundername: NIMHD NIH HHS
  grantid: S21 MD010690
– fundername: NCI NIH HHS
  grantid: U54 CA285117
GroupedDBID NPM
ID FETCH-LOGICAL-m1368-8a220131d44d9aa69dfec31c1f48f68d1030790be164994ab2d580a5e383106b2
IngestDate Sun Aug 17 02:24:25 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Keywords Named Entity Recognition
Biomedical language models
Cancer epidemiology
Bayesian analysis
Carcinogen identification
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-m1368-8a220131d44d9aa69dfec31c1f48f68d1030790be164994ab2d580a5e383106b2
ORCID 0000-0003-3596-4515
0000-0001-9873-5117
OpenAccessLink https://www.medrxiv.org/content/10.1101/2024.08.13.24311948
PMID 39185518
ParticipantIDs pubmed_primary_39185518
PublicationCentury 2000
PublicationDate 2024-Aug-31
PublicationDateYYYYMMDD 2024-08-31
PublicationDate_xml – month: 08
  year: 2024
  text: 2024-Aug-31
  day: 31
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle medRxiv : the preprint server for health sciences
PublicationTitleAlternate medRxiv
PublicationYear 2024
Score 1.883633
SecondaryResourceType preprint
Snippet The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning...
SourceID pubmed
SourceType Index Database
Title CarD-T: Interpreting Carcinomic Lexicon via Transformers
URI https://www.ncbi.nlm.nih.gov/pubmed/39185518
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1ZS8NAEMcXqyC-iOJ9kQffQmqTbNJd36T1QLwoFXyTPTFoY4lRip_e2c3RUipUX0LItiXNL5udmcz8B6FjIrGSrBV6AYuUh8Ho9WhbC49FmjLJqcRWSPv2Lr56xNdP0dO4vZWtLsl5U3zPrCv5D1U4BlxNlewfyNY_CgdgH_jCFgjDdi7GHZZ1vb5x6uvcQfsa3_QHsuXG7o0aAerU_UpYoWNubNQy6b2ySWE97I2SL7dK8RgaocskzV0TsFWZTUQsyiXdcr2s7XCTKmPCyG9Vwm0doO8pWTy-Lz_lZwZ3wUuW5HXouWveQZSlXrA0KJXXq0M_S2yXZDN2_5G8ssm4RICrQOusp7LtBmA-ZORS_bAZgNni00Jic4LTcGBBhRSMiMifY3RKKrsaaqAGOA2mC-rDbak0BadwMuMEVtBy9aUpv8LaF_01tFo6Bs5ZQXkdLah0A5GC8KkzydcZ83VKvg7wdSb5bqLHi_N-58ore114Az-MiUdYEBjpI5gbkjIWU6mVCH3ha0x0TKTpBtemLa7AvaUUMx7IiLRgfoWmU1zMgy20mL6nagc5UaC5FjpisRBYSMwxl0S0NeU8VG3S2kXbxR99HhaCJs_VJdj7dWQfrYwZH6AlDTNIHYI5lvMje5l_AHTENK0
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CarD-T%3A+Interpreting+Carcinomic+Lexicon+via+Transformers&rft.jtitle=medRxiv+%3A+the+preprint+server+for+health+sciences&rft.au=O%27Neill%2C+Jamey&rft.au=Reddy%2C+Gudur+Ashrith&rft.au=Dhillon%2C+Nermeeta&rft.au=Tripathi%2C+Osika&rft.date=2024-08-31&rft_id=info:doi/10.1101%2F2024.08.13.24311948&rft_id=info%3Apmid%2F39185518&rft_id=info%3Apmid%2F39185518&rft.externalDocID=39185518