Detecting Biosignatures in Complex Molecular Mixtures From Pyrolysis‐Gas Chromatography‐Mass Spectrometry Data Using Machine Learning

Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass sp...

Full description

Saved in:
Bibliographic Details
Published inJournal of geophysical research. Machine learning and computation Vol. 2; no. 3
Main Authors Hystad, Grethe, Cleaves, H. James, Garmon, Collin A., Wong, Michael L., Prabhu, Anirudh, Cody, George D., Hazen, Robert M.
Format Journal Article
LanguageEnglish
Published 01.09.2025
Online AccessGet full text

Cover

Loading…
Abstract Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been employed in space missions. We analyzed various organic carbon‐bearing geomaterials to consider relationships among suites of molecules that can help identify their biogenicity and potentially be used to analyze data from various solar system exploration missions. These supervised classification models can discriminate between abiotic and biotic samples with ∼86–89% accuracy. We use and compare 4 different ML models, coupled with range of statistical and visualization methods, to investigate the patterns and distribution of diagnostic features— specific combinations of chromatographic retention time and mass‐to‐charge ratio, which contribute to the classification of the samples into biologically derived versus abiologically derived materials. These diagnostic discriminators are common in biotic samples and rare in most abiotic samples and hence point to a potential agnostic molecular biosignature. They also tend to have higher normalized intensity values in biologically derived materials and display different distributions in contemporary biotic samples compared to taphonomically altered biotic samples. We utilize the full resolution of the 3D structure of the py‐GC‐MS data and describe in detail the preprocessing steps and the ML pipeline for analyzing such data, which could be automated for future data collection. Astrobiology and paleobiology are concerned with determining what distinguishes the chemistry of life from the chemistry of the nonliving world. We hypothesize that the diversity and distribution of molecules in biologically derived materials (e.g., plants, animal tissue, bacteria, and coal) are different than those in abiotic materials (e.g., carbon‐rich meteorites and laboratory‐made synthetic reactions). To test this hypothesis, we analyzed a diverse collection of natural and synthetic organic molecular mixtures using pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been used in solar system exploration missions. In py‐GC‐MS, samples are heated, decomposed into smaller components, and separated into fragment ions for molecular identification. We train and compare several machine learning classification models to predict the biogenicity of the samples and to determine the patterns and distribution of features—specific combinations of chromatographic retention time and mass‐to‐charge ratio that are important for distinguishing biologically derived samples from abiotic ones. These diagnostic features are both more commonly present and occur in greater abundance in biotic samples than abiotic samples, and hence serve as potential molecular biosignatures. Machine learning is applied to pyrolysis‐gas chromatography‐mass spectrometry to predict the biogenicity in various carbonaceous materials Diagnostic features for discriminating biologically derived samples from abiotic samples have been identified Potential molecular features identified as diagnostic biochemical discriminators are common in biotic samples and rare in most abiotic ones
AbstractList Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been employed in space missions. We analyzed various organic carbon‐bearing geomaterials to consider relationships among suites of molecules that can help identify their biogenicity and potentially be used to analyze data from various solar system exploration missions. These supervised classification models can discriminate between abiotic and biotic samples with ∼86–89% accuracy. We use and compare 4 different ML models, coupled with range of statistical and visualization methods, to investigate the patterns and distribution of diagnostic features— specific combinations of chromatographic retention time and mass‐to‐charge ratio, which contribute to the classification of the samples into biologically derived versus abiologically derived materials. These diagnostic discriminators are common in biotic samples and rare in most abiotic samples and hence point to a potential agnostic molecular biosignature. They also tend to have higher normalized intensity values in biologically derived materials and display different distributions in contemporary biotic samples compared to taphonomically altered biotic samples. We utilize the full resolution of the 3D structure of the py‐GC‐MS data and describe in detail the preprocessing steps and the ML pipeline for analyzing such data, which could be automated for future data collection. Astrobiology and paleobiology are concerned with determining what distinguishes the chemistry of life from the chemistry of the nonliving world. We hypothesize that the diversity and distribution of molecules in biologically derived materials (e.g., plants, animal tissue, bacteria, and coal) are different than those in abiotic materials (e.g., carbon‐rich meteorites and laboratory‐made synthetic reactions). To test this hypothesis, we analyzed a diverse collection of natural and synthetic organic molecular mixtures using pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been used in solar system exploration missions. In py‐GC‐MS, samples are heated, decomposed into smaller components, and separated into fragment ions for molecular identification. We train and compare several machine learning classification models to predict the biogenicity of the samples and to determine the patterns and distribution of features—specific combinations of chromatographic retention time and mass‐to‐charge ratio that are important for distinguishing biologically derived samples from abiotic ones. These diagnostic features are both more commonly present and occur in greater abundance in biotic samples than abiotic samples, and hence serve as potential molecular biosignatures. Machine learning is applied to pyrolysis‐gas chromatography‐mass spectrometry to predict the biogenicity in various carbonaceous materials Diagnostic features for discriminating biologically derived samples from abiotic samples have been identified Potential molecular features identified as diagnostic biochemical discriminators are common in biotic samples and rare in most abiotic ones
Author Cleaves, H. James
Wong, Michael L.
Hazen, Robert M.
Hystad, Grethe
Cody, George D.
Prabhu, Anirudh
Garmon, Collin A.
Author_xml – sequence: 1
  givenname: Grethe
  orcidid: 0000-0001-9572-1019
  surname: Hystad
  fullname: Hystad, Grethe
  organization: Department of Mathematics and Statistics Purdue University Northwest Hammond IN USA
– sequence: 2
  givenname: H. James
  orcidid: 0000-0003-4101-0654
  surname: Cleaves
  fullname: Cleaves, H. James
  organization: Department of Chemistry Howard University Washington DC USA, Earth Life Science Institute Tokyo Institute of Technology Tokyo Japan, Blue Marble Space Institute for Science Seattle WA USA
– sequence: 3
  givenname: Collin A.
  orcidid: 0009-0003-0657-9682
  surname: Garmon
  fullname: Garmon, Collin A.
  organization: Department of Mathematics and Statistics Purdue University Northwest Hammond IN USA, Now at Department of Mathematical Sciences Purdue University Fort Wayne Fort Wayne IN USA
– sequence: 4
  givenname: Michael L.
  surname: Wong
  fullname: Wong, Michael L.
  organization: Earth and Planets Laboratory Carnegie Science Washington DC USA, NHFP Sagan Fellow NASA Hubble Fellowship Program Space Telescope Science Institute Baltimore MD USA
– sequence: 5
  givenname: Anirudh
  orcidid: 0000-0002-9921-6084
  surname: Prabhu
  fullname: Prabhu, Anirudh
  organization: Earth and Planets Laboratory Carnegie Science Washington DC USA
– sequence: 6
  givenname: George D.
  surname: Cody
  fullname: Cody, George D.
  organization: Earth and Planets Laboratory Carnegie Science Washington DC USA
– sequence: 7
  givenname: Robert M.
  orcidid: 0000-0003-4163-8644
  surname: Hazen
  fullname: Hazen, Robert M.
  organization: Earth and Planets Laboratory Carnegie Science Washington DC USA
BookMark eNpNkMFOwzAQRC0EEqX0xgf4AwjYjtMkR0hpC2oEEuUcbdxNa5TakZ1KzY0rN76RLyFVOfQ0qzerGWmuyLmxBgm54eyOM5HeCybky5wxJiU_IwORpmEQCc7OT-5LMvL-s_8JQ8ESFg_I9wRbVK02a_qorddrA-3Ooafa0Mxumxr3NLc1ql0NjuZ6f3Snzm7pW-ds3Xntf79-ZuBptukptHbtoNl0PczBe_re9Pm9ga3r6ARaoB_-UJeD2miDdIHgTA-uyUUFtcfRvw7Jcvq0zObB4nX2nD0sAhVHPBCrqOTI5KpUY1VxVfISMcQkElE0TjgAi2RSQZlAvGKxlGHF0gpjJaHClCcYDsntMVY5673Dqmic3oLrCs6Kw5DF6ZDhH-jQbMk
Cites_doi 10.1039/d2sc00256f
10.5281/zenodo.15615898
10.1016/j.chroma.2006.06.087
10.18637/jss.v033.i01
10.1016/0168‐583X(88)90063‐8
10.21105/joss.01903
10.1007/BF00994018
10.1038/srep09414
10.1023/A:1010933404324
10.1214/aoms/1177732979
10.1016/j.icarus.2003.08.011
10.1089/ast.2017.1712
10.18637/jss.v028.i05
10.1021/jasms.3c00059
10.1093/mnras/stab3478
10.1073/pnas.2307149120
10.1007/BF02478215
10.1073/pnas.2310223120
10.17605/OSF.IO/EMBH8
10.1007/978-1-4939-0983-4
10.1111/j.1467‐9868.2005.00503.x
10.48550/arXiv.1612.08714
10.1007/s12052‐012‐0443‐9
10.1214/AOS/1013203451
10.3390/life11030234
10.1093/bioinformatics/bts447
10.1007/978-1-0716-1418-1
10.1089/15311070260192246
10.1126/sciadv.add7925
10.2307/1267351
10.1038/207568a0
10.1038/s41467‐021‐23258‐x
10.1016/j.patrec.2010.03.014
10.1111/j.2517‐6161.1995.tb02031.x
10.1089/ast.2020.2394
10.1089/ast.2018.1903
10.1093/bioinformatics/bth357
10.1016/j.chemolab.2011.08.009
10.1111/j.2517‐6161.1996.tb02080.x
ContentType Journal Article
DBID AAYXX
CITATION
DOI 10.1029/2024JH000441
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList CrossRef
DeliveryMethod fulltext_linktorsrc
EISSN 2993-5210
ExternalDocumentID 10_1029_2024JH000441
GroupedDBID 0R~
24P
AAMMB
AAYXX
ACCMX
AEFGJ
AGXDD
AIDQK
AIDYY
ALMA_UNASSIGNED_HOLDINGS
CITATION
GROUPED_DOAJ
M~E
WIN
ID FETCH-LOGICAL-c751-2d5b1e04dbc6cf1cb1bee3e85255681aa0548fab8a7d07443f09fe7c4afe918e3
ISSN 2993-5210
IngestDate Thu Jul 31 00:30:15 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c751-2d5b1e04dbc6cf1cb1bee3e85255681aa0548fab8a7d07443f09fe7c4afe918e3
ORCID 0000-0003-4101-0654
0009-0003-0657-9682
0000-0001-9572-1019
0000-0002-9921-6084
0000-0003-4163-8644
OpenAccessLink https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2024JH000441
ParticipantIDs crossref_primary_10_1029_2024JH000441
PublicationCentury 2000
PublicationDate 2025-09-00
PublicationDateYYYYMMDD 2025-09-01
PublicationDate_xml – month: 09
  year: 2025
  text: 2025-09-00
PublicationDecade 2020
PublicationTitle Journal of geophysical research. Machine learning and computation
PublicationYear 2025
References e_1_2_9_31_1
e_1_2_9_52_1
e_1_2_9_50_1
James G. (e_1_2_9_33_1) 2021
e_1_2_9_10_1
e_1_2_9_35_1
e_1_2_9_12_1
e_1_2_9_54_1
Mahalanobis P. C. (e_1_2_9_42_1) 1936; 2
e_1_2_9_14_1
e_1_2_9_39_1
Chen T. (e_1_2_9_8_1) 2016
e_1_2_9_16_1
e_1_2_9_37_1
e_1_2_9_18_1
e_1_2_9_41_1
e_1_2_9_20_1
e_1_2_9_22_1
e_1_2_9_45_1
Csárdi G. (e_1_2_9_13_1) 2006; 1695
e_1_2_9_43_1
e_1_2_9_6_1
e_1_2_9_4_1
e_1_2_9_2_1
Han H. (e_1_2_9_24_1) 2016
e_1_2_9_26_1
e_1_2_9_49_1
Binder M. (e_1_2_9_5_1) 2021; 22
e_1_2_9_28_1
e_1_2_9_47_1
R Core Team (e_1_2_9_46_1) 2024
e_1_2_9_30_1
e_1_2_9_53_1
e_1_2_9_51_1
e_1_2_9_11_1
e_1_2_9_32_1
e_1_2_9_15_1
e_1_2_9_38_1
e_1_2_9_17_1
e_1_2_9_19_1
Johnson R. A. (e_1_2_9_34_1) 2007
e_1_2_9_40_1
e_1_2_9_21_1
e_1_2_9_23_1
e_1_2_9_44_1
e_1_2_9_7_1
e_1_2_9_3_1
Hastie T. (e_1_2_9_25_1) 2009
e_1_2_9_9_1
Kolaczyk E. D. (e_1_2_9_36_1) 2014
e_1_2_9_27_1
e_1_2_9_48_1
e_1_2_9_29_1
References_xml – ident: e_1_2_9_2_1
  doi: 10.1039/d2sc00256f
– volume: 1695
  year: 2006
  ident: e_1_2_9_13_1
  article-title: The igraph software package for complex network research
  publication-title: InterJournal Complex Systems
– ident: e_1_2_9_30_1
  doi: 10.5281/zenodo.15615898
– ident: e_1_2_9_52_1
  doi: 10.1016/j.chroma.2006.06.087
– ident: e_1_2_9_3_1
– ident: e_1_2_9_18_1
  doi: 10.18637/jss.v033.i01
– ident: e_1_2_9_47_1
  doi: 10.1016/0168‐583X(88)90063‐8
– ident: e_1_2_9_39_1
  doi: 10.21105/joss.01903
– ident: e_1_2_9_12_1
  doi: 10.1007/BF00994018
– ident: e_1_2_9_32_1
  doi: 10.1038/srep09414
– volume-title: Applied multivariate statistical analysis
  year: 2007
  ident: e_1_2_9_34_1
– ident: e_1_2_9_6_1
  doi: 10.1023/A:1010933404324
– ident: e_1_2_9_28_1
  doi: 10.1214/aoms/1177732979
– ident: e_1_2_9_17_1
  doi: 10.1016/j.icarus.2003.08.011
– ident: e_1_2_9_35_1
  doi: 10.1089/ast.2017.1712
– ident: e_1_2_9_37_1
  doi: 10.18637/jss.v028.i05
– ident: e_1_2_9_38_1
– ident: e_1_2_9_16_1
  doi: 10.1021/jasms.3c00059
– ident: e_1_2_9_31_1
  doi: 10.1093/mnras/stab3478
– ident: e_1_2_9_11_1
  doi: 10.1073/pnas.2307149120
– ident: e_1_2_9_44_1
  doi: 10.1007/BF02478215
– ident: e_1_2_9_49_1
– ident: e_1_2_9_53_1
  doi: 10.1073/pnas.2310223120
– ident: e_1_2_9_10_1
  doi: 10.17605/OSF.IO/EMBH8
– volume-title: Statistical analysis of network data with R
  year: 2014
  ident: e_1_2_9_36_1
  doi: 10.1007/978-1-4939-0983-4
– ident: e_1_2_9_54_1
  doi: 10.1111/j.1467‐9868.2005.00503.x
– ident: e_1_2_9_26_1
  doi: 10.48550/arXiv.1612.08714
– ident: e_1_2_9_14_1
– ident: e_1_2_9_9_1
  doi: 10.1007/s12052‐012‐0443‐9
– ident: e_1_2_9_19_1
  doi: 10.1214/AOS/1013203451
– volume: 22
  start-page: 1
  issue: 184
  year: 2021
  ident: e_1_2_9_5_1
  article-title: mlr3pipelines ‐ flexible machine learning pipelines in R
  publication-title: Journal of Machine Learning Research
– ident: e_1_2_9_23_1
  doi: 10.3390/life11030234
– ident: e_1_2_9_21_1
  doi: 10.1093/bioinformatics/bts447
– volume-title: An introduction to statistical learning with applications in R
  year: 2021
  ident: e_1_2_9_33_1
  doi: 10.1007/978-1-0716-1418-1
– start-page: 219
  volume-title: Paper presented at 2016 7th IEEE International Conference on software Engineering and Service Science (ICSESS)
  year: 2016
  ident: e_1_2_9_24_1
– volume: 2
  start-page: 49
  issue: 1
  year: 1936
  ident: e_1_2_9_42_1
  article-title: On the generalized distance in statistics
  publication-title: National Institute of Science of India
– ident: e_1_2_9_15_1
  doi: 10.1089/15311070260192246
– ident: e_1_2_9_45_1
  doi: 10.1126/sciadv.add7925
– ident: e_1_2_9_27_1
  doi: 10.2307/1267351
– ident: e_1_2_9_41_1
  doi: 10.1038/207568a0
– ident: e_1_2_9_43_1
  doi: 10.1038/s41467‐021‐23258‐x
– ident: e_1_2_9_20_1
  doi: 10.1016/j.patrec.2010.03.014
– ident: e_1_2_9_4_1
  doi: 10.1111/j.2517‐6161.1995.tb02031.x
– ident: e_1_2_9_40_1
– volume-title: R: A language and environment for statistical computing
  year: 2024
  ident: e_1_2_9_46_1
– volume-title: Springer Series in Statistics
  year: 2009
  ident: e_1_2_9_25_1
– start-page: 785
  volume-title: Paper presented at Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining
  year: 2016
  ident: e_1_2_9_8_1
– ident: e_1_2_9_48_1
  doi: 10.1089/ast.2020.2394
– ident: e_1_2_9_7_1
  doi: 10.1089/ast.2018.1903
– ident: e_1_2_9_51_1
  doi: 10.1093/bioinformatics/bth357
– ident: e_1_2_9_22_1
  doi: 10.1016/j.chemolab.2011.08.009
– ident: e_1_2_9_50_1
  doi: 10.1111/j.2517‐6161.1996.tb02080.x
– ident: e_1_2_9_29_1
SSID ssj0003320807
Score 2.3020885
Snippet Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology...
SourceID crossref
SourceType Index Database
Title Detecting Biosignatures in Complex Molecular Mixtures From Pyrolysis‐Gas Chromatography‐Mass Spectrometry Data Using Machine Learning
Volume 2
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3LbtNAFB2FsmGDqADxrGYBq8jGHs8k9rI8ilVhxCJI3UXzcolE4yh1UMOi6pYd39ZP4Eu48_DEQUEqbCxrHI8j3-O5d-6cewahF0JIpvJcRJSLcUSJUhFXIx7RQmhGlSwkNwn96uOo_EyPT9jJYHDdYy2tWhHL7zvrSv7HqtAGdjVVsv9g2dApNMA52BeOYGE43sjGb7VZAjCT_dezxjAxrErnuavkM7K_F_DN-u1vh9Xswl09MhUln9bLxqqRBLbDe35upAYaCGG9jHW4VEGEbXeqb424QbtcA1haPnR0g8rSMXWn1Hr6l3D3VDeLDhNeYehLHG7-2mVofJndYrVNESjXEMUqX1QDMeuGXqT5NzfSlbGj_AZGEV-eeUqBFR4fHsbBA3kesq8ZGH6I-8kPwgK7y4-RxNAPIQJxSzt6R5sf5EkPy9lO15EQo7wKD6HHpV3nTjcusqMF_OE5A5_RruSTYtq_-xa6TWDqYnbVqC43eb8sI4mr4g__09djQAev-h30IqVeyDO5h-564-FDB7x9NNDz--hHAB3eAh2ezbEHHQ6gwx3osAEdDqD7dfUT4Ia34QaNBmi4DzRsgIYt0LDHCu6A9gBNjt5N3pSR388jkmOWRkQxkeqEKiFHsk6lSIXWmc6ZU8HjHGYPec1FzscKAlua1UlR67GkvNZFmuvsIdqbN3P9CGGRCCUzxhX4H3hTKc84M1JaRSGTEVH1Y_Sye3XThVNtme6y0ZMb_u4purNB3zO01y5X-jkEo604sEmcA2vj3y8Ikdw
linkProvider ISSN International Centre
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Detecting+Biosignatures+in+Complex+Molecular+Mixtures+From+Pyrolysis%E2%80%90Gas+Chromatography%E2%80%90Mass+Spectrometry+Data+Using+Machine+Learning&rft.jtitle=Journal+of+geophysical+research.+Machine+learning+and+computation&rft.au=Hystad%2C+Grethe&rft.au=Cleaves%2C+H.+James&rft.au=Garmon%2C+Collin+A.&rft.au=Wong%2C+Michael+L.&rft.date=2025-09-01&rft.issn=2993-5210&rft.eissn=2993-5210&rft.volume=2&rft.issue=3&rft_id=info:doi/10.1029%2F2024JH000441&rft.externalDBID=n%2Fa&rft.externalDocID=10_1029_2024JH000441
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2993-5210&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2993-5210&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2993-5210&client=summon