Detecting Biosignatures in Complex Molecular Mixtures From Pyrolysis‐Gas Chromatography‐Mass Spectrometry Data Using Machine Learning
Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass sp...
Saved in:
Published in | Journal of geophysical research. Machine learning and computation Vol. 2; no. 3 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
01.09.2025
|
Online Access | Get full text |
Cover
Loading…
Abstract | Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been employed in space missions. We analyzed various organic carbon‐bearing geomaterials to consider relationships among suites of molecules that can help identify their biogenicity and potentially be used to analyze data from various solar system exploration missions. These supervised classification models can discriminate between abiotic and biotic samples with ∼86–89% accuracy. We use and compare 4 different ML models, coupled with range of statistical and visualization methods, to investigate the patterns and distribution of diagnostic features— specific combinations of chromatographic retention time and mass‐to‐charge ratio, which contribute to the classification of the samples into biologically derived versus abiologically derived materials. These diagnostic discriminators are common in biotic samples and rare in most abiotic samples and hence point to a potential agnostic molecular biosignature. They also tend to have higher normalized intensity values in biologically derived materials and display different distributions in contemporary biotic samples compared to taphonomically altered biotic samples. We utilize the full resolution of the 3D structure of the py‐GC‐MS data and describe in detail the preprocessing steps and the ML pipeline for analyzing such data, which could be automated for future data collection.
Astrobiology and paleobiology are concerned with determining what distinguishes the chemistry of life from the chemistry of the nonliving world. We hypothesize that the diversity and distribution of molecules in biologically derived materials (e.g., plants, animal tissue, bacteria, and coal) are different than those in abiotic materials (e.g., carbon‐rich meteorites and laboratory‐made synthetic reactions). To test this hypothesis, we analyzed a diverse collection of natural and synthetic organic molecular mixtures using pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been used in solar system exploration missions. In py‐GC‐MS, samples are heated, decomposed into smaller components, and separated into fragment ions for molecular identification. We train and compare several machine learning classification models to predict the biogenicity of the samples and to determine the patterns and distribution of features—specific combinations of chromatographic retention time and mass‐to‐charge ratio that are important for distinguishing biologically derived samples from abiotic ones. These diagnostic features are both more commonly present and occur in greater abundance in biotic samples than abiotic samples, and hence serve as potential molecular biosignatures.
Machine learning is applied to pyrolysis‐gas chromatography‐mass spectrometry to predict the biogenicity in various carbonaceous materials Diagnostic features for discriminating biologically derived samples from abiotic samples have been identified Potential molecular features identified as diagnostic biochemical discriminators are common in biotic samples and rare in most abiotic ones |
---|---|
AbstractList | Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been employed in space missions. We analyzed various organic carbon‐bearing geomaterials to consider relationships among suites of molecules that can help identify their biogenicity and potentially be used to analyze data from various solar system exploration missions. These supervised classification models can discriminate between abiotic and biotic samples with ∼86–89% accuracy. We use and compare 4 different ML models, coupled with range of statistical and visualization methods, to investigate the patterns and distribution of diagnostic features— specific combinations of chromatographic retention time and mass‐to‐charge ratio, which contribute to the classification of the samples into biologically derived versus abiologically derived materials. These diagnostic discriminators are common in biotic samples and rare in most abiotic samples and hence point to a potential agnostic molecular biosignature. They also tend to have higher normalized intensity values in biologically derived materials and display different distributions in contemporary biotic samples compared to taphonomically altered biotic samples. We utilize the full resolution of the 3D structure of the py‐GC‐MS data and describe in detail the preprocessing steps and the ML pipeline for analyzing such data, which could be automated for future data collection.
Astrobiology and paleobiology are concerned with determining what distinguishes the chemistry of life from the chemistry of the nonliving world. We hypothesize that the diversity and distribution of molecules in biologically derived materials (e.g., plants, animal tissue, bacteria, and coal) are different than those in abiotic materials (e.g., carbon‐rich meteorites and laboratory‐made synthetic reactions). To test this hypothesis, we analyzed a diverse collection of natural and synthetic organic molecular mixtures using pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been used in solar system exploration missions. In py‐GC‐MS, samples are heated, decomposed into smaller components, and separated into fragment ions for molecular identification. We train and compare several machine learning classification models to predict the biogenicity of the samples and to determine the patterns and distribution of features—specific combinations of chromatographic retention time and mass‐to‐charge ratio that are important for distinguishing biologically derived samples from abiotic ones. These diagnostic features are both more commonly present and occur in greater abundance in biotic samples than abiotic samples, and hence serve as potential molecular biosignatures.
Machine learning is applied to pyrolysis‐gas chromatography‐mass spectrometry to predict the biogenicity in various carbonaceous materials Diagnostic features for discriminating biologically derived samples from abiotic samples have been identified Potential molecular features identified as diagnostic biochemical discriminators are common in biotic samples and rare in most abiotic ones |
Author | Cleaves, H. James Wong, Michael L. Hazen, Robert M. Hystad, Grethe Cody, George D. Prabhu, Anirudh Garmon, Collin A. |
Author_xml | – sequence: 1 givenname: Grethe orcidid: 0000-0001-9572-1019 surname: Hystad fullname: Hystad, Grethe organization: Department of Mathematics and Statistics Purdue University Northwest Hammond IN USA – sequence: 2 givenname: H. James orcidid: 0000-0003-4101-0654 surname: Cleaves fullname: Cleaves, H. James organization: Department of Chemistry Howard University Washington DC USA, Earth Life Science Institute Tokyo Institute of Technology Tokyo Japan, Blue Marble Space Institute for Science Seattle WA USA – sequence: 3 givenname: Collin A. orcidid: 0009-0003-0657-9682 surname: Garmon fullname: Garmon, Collin A. organization: Department of Mathematics and Statistics Purdue University Northwest Hammond IN USA, Now at Department of Mathematical Sciences Purdue University Fort Wayne Fort Wayne IN USA – sequence: 4 givenname: Michael L. surname: Wong fullname: Wong, Michael L. organization: Earth and Planets Laboratory Carnegie Science Washington DC USA, NHFP Sagan Fellow NASA Hubble Fellowship Program Space Telescope Science Institute Baltimore MD USA – sequence: 5 givenname: Anirudh orcidid: 0000-0002-9921-6084 surname: Prabhu fullname: Prabhu, Anirudh organization: Earth and Planets Laboratory Carnegie Science Washington DC USA – sequence: 6 givenname: George D. surname: Cody fullname: Cody, George D. organization: Earth and Planets Laboratory Carnegie Science Washington DC USA – sequence: 7 givenname: Robert M. orcidid: 0000-0003-4163-8644 surname: Hazen fullname: Hazen, Robert M. organization: Earth and Planets Laboratory Carnegie Science Washington DC USA |
BookMark | eNpNkMFOwzAQRC0EEqX0xgf4AwjYjtMkR0hpC2oEEuUcbdxNa5TakZ1KzY0rN76RLyFVOfQ0qzerGWmuyLmxBgm54eyOM5HeCybky5wxJiU_IwORpmEQCc7OT-5LMvL-s_8JQ8ESFg_I9wRbVK02a_qorddrA-3Ooafa0Mxumxr3NLc1ql0NjuZ6f3Snzm7pW-ds3Xntf79-ZuBptukptHbtoNl0PczBe_re9Pm9ga3r6ARaoB_-UJeD2miDdIHgTA-uyUUFtcfRvw7Jcvq0zObB4nX2nD0sAhVHPBCrqOTI5KpUY1VxVfISMcQkElE0TjgAi2RSQZlAvGKxlGHF0gpjJaHClCcYDsntMVY5673Dqmic3oLrCs6Kw5DF6ZDhH-jQbMk |
Cites_doi | 10.1039/d2sc00256f 10.5281/zenodo.15615898 10.1016/j.chroma.2006.06.087 10.18637/jss.v033.i01 10.1016/0168‐583X(88)90063‐8 10.21105/joss.01903 10.1007/BF00994018 10.1038/srep09414 10.1023/A:1010933404324 10.1214/aoms/1177732979 10.1016/j.icarus.2003.08.011 10.1089/ast.2017.1712 10.18637/jss.v028.i05 10.1021/jasms.3c00059 10.1093/mnras/stab3478 10.1073/pnas.2307149120 10.1007/BF02478215 10.1073/pnas.2310223120 10.17605/OSF.IO/EMBH8 10.1007/978-1-4939-0983-4 10.1111/j.1467‐9868.2005.00503.x 10.48550/arXiv.1612.08714 10.1007/s12052‐012‐0443‐9 10.1214/AOS/1013203451 10.3390/life11030234 10.1093/bioinformatics/bts447 10.1007/978-1-0716-1418-1 10.1089/15311070260192246 10.1126/sciadv.add7925 10.2307/1267351 10.1038/207568a0 10.1038/s41467‐021‐23258‐x 10.1016/j.patrec.2010.03.014 10.1111/j.2517‐6161.1995.tb02031.x 10.1089/ast.2020.2394 10.1089/ast.2018.1903 10.1093/bioinformatics/bth357 10.1016/j.chemolab.2011.08.009 10.1111/j.2517‐6161.1996.tb02080.x |
ContentType | Journal Article |
DBID | AAYXX CITATION |
DOI | 10.1029/2024JH000441 |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
DeliveryMethod | fulltext_linktorsrc |
EISSN | 2993-5210 |
ExternalDocumentID | 10_1029_2024JH000441 |
GroupedDBID | 0R~ 24P AAMMB AAYXX ACCMX AEFGJ AGXDD AIDQK AIDYY ALMA_UNASSIGNED_HOLDINGS CITATION GROUPED_DOAJ M~E WIN |
ID | FETCH-LOGICAL-c751-2d5b1e04dbc6cf1cb1bee3e85255681aa0548fab8a7d07443f09fe7c4afe918e3 |
ISSN | 2993-5210 |
IngestDate | Thu Jul 31 00:30:15 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c751-2d5b1e04dbc6cf1cb1bee3e85255681aa0548fab8a7d07443f09fe7c4afe918e3 |
ORCID | 0000-0003-4101-0654 0009-0003-0657-9682 0000-0001-9572-1019 0000-0002-9921-6084 0000-0003-4163-8644 |
OpenAccessLink | https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2024JH000441 |
ParticipantIDs | crossref_primary_10_1029_2024JH000441 |
PublicationCentury | 2000 |
PublicationDate | 2025-09-00 |
PublicationDateYYYYMMDD | 2025-09-01 |
PublicationDate_xml | – month: 09 year: 2025 text: 2025-09-00 |
PublicationDecade | 2020 |
PublicationTitle | Journal of geophysical research. Machine learning and computation |
PublicationYear | 2025 |
References | e_1_2_9_31_1 e_1_2_9_52_1 e_1_2_9_50_1 James G. (e_1_2_9_33_1) 2021 e_1_2_9_10_1 e_1_2_9_35_1 e_1_2_9_12_1 e_1_2_9_54_1 Mahalanobis P. C. (e_1_2_9_42_1) 1936; 2 e_1_2_9_14_1 e_1_2_9_39_1 Chen T. (e_1_2_9_8_1) 2016 e_1_2_9_16_1 e_1_2_9_37_1 e_1_2_9_18_1 e_1_2_9_41_1 e_1_2_9_20_1 e_1_2_9_22_1 e_1_2_9_45_1 Csárdi G. (e_1_2_9_13_1) 2006; 1695 e_1_2_9_43_1 e_1_2_9_6_1 e_1_2_9_4_1 e_1_2_9_2_1 Han H. (e_1_2_9_24_1) 2016 e_1_2_9_26_1 e_1_2_9_49_1 Binder M. (e_1_2_9_5_1) 2021; 22 e_1_2_9_28_1 e_1_2_9_47_1 R Core Team (e_1_2_9_46_1) 2024 e_1_2_9_30_1 e_1_2_9_53_1 e_1_2_9_51_1 e_1_2_9_11_1 e_1_2_9_32_1 e_1_2_9_15_1 e_1_2_9_38_1 e_1_2_9_17_1 e_1_2_9_19_1 Johnson R. A. (e_1_2_9_34_1) 2007 e_1_2_9_40_1 e_1_2_9_21_1 e_1_2_9_23_1 e_1_2_9_44_1 e_1_2_9_7_1 e_1_2_9_3_1 Hastie T. (e_1_2_9_25_1) 2009 e_1_2_9_9_1 Kolaczyk E. D. (e_1_2_9_36_1) 2014 e_1_2_9_27_1 e_1_2_9_48_1 e_1_2_9_29_1 |
References_xml | – ident: e_1_2_9_2_1 doi: 10.1039/d2sc00256f – volume: 1695 year: 2006 ident: e_1_2_9_13_1 article-title: The igraph software package for complex network research publication-title: InterJournal Complex Systems – ident: e_1_2_9_30_1 doi: 10.5281/zenodo.15615898 – ident: e_1_2_9_52_1 doi: 10.1016/j.chroma.2006.06.087 – ident: e_1_2_9_3_1 – ident: e_1_2_9_18_1 doi: 10.18637/jss.v033.i01 – ident: e_1_2_9_47_1 doi: 10.1016/0168‐583X(88)90063‐8 – ident: e_1_2_9_39_1 doi: 10.21105/joss.01903 – ident: e_1_2_9_12_1 doi: 10.1007/BF00994018 – ident: e_1_2_9_32_1 doi: 10.1038/srep09414 – volume-title: Applied multivariate statistical analysis year: 2007 ident: e_1_2_9_34_1 – ident: e_1_2_9_6_1 doi: 10.1023/A:1010933404324 – ident: e_1_2_9_28_1 doi: 10.1214/aoms/1177732979 – ident: e_1_2_9_17_1 doi: 10.1016/j.icarus.2003.08.011 – ident: e_1_2_9_35_1 doi: 10.1089/ast.2017.1712 – ident: e_1_2_9_37_1 doi: 10.18637/jss.v028.i05 – ident: e_1_2_9_38_1 – ident: e_1_2_9_16_1 doi: 10.1021/jasms.3c00059 – ident: e_1_2_9_31_1 doi: 10.1093/mnras/stab3478 – ident: e_1_2_9_11_1 doi: 10.1073/pnas.2307149120 – ident: e_1_2_9_44_1 doi: 10.1007/BF02478215 – ident: e_1_2_9_49_1 – ident: e_1_2_9_53_1 doi: 10.1073/pnas.2310223120 – ident: e_1_2_9_10_1 doi: 10.17605/OSF.IO/EMBH8 – volume-title: Statistical analysis of network data with R year: 2014 ident: e_1_2_9_36_1 doi: 10.1007/978-1-4939-0983-4 – ident: e_1_2_9_54_1 doi: 10.1111/j.1467‐9868.2005.00503.x – ident: e_1_2_9_26_1 doi: 10.48550/arXiv.1612.08714 – ident: e_1_2_9_14_1 – ident: e_1_2_9_9_1 doi: 10.1007/s12052‐012‐0443‐9 – ident: e_1_2_9_19_1 doi: 10.1214/AOS/1013203451 – volume: 22 start-page: 1 issue: 184 year: 2021 ident: e_1_2_9_5_1 article-title: mlr3pipelines ‐ flexible machine learning pipelines in R publication-title: Journal of Machine Learning Research – ident: e_1_2_9_23_1 doi: 10.3390/life11030234 – ident: e_1_2_9_21_1 doi: 10.1093/bioinformatics/bts447 – volume-title: An introduction to statistical learning with applications in R year: 2021 ident: e_1_2_9_33_1 doi: 10.1007/978-1-0716-1418-1 – start-page: 219 volume-title: Paper presented at 2016 7th IEEE International Conference on software Engineering and Service Science (ICSESS) year: 2016 ident: e_1_2_9_24_1 – volume: 2 start-page: 49 issue: 1 year: 1936 ident: e_1_2_9_42_1 article-title: On the generalized distance in statistics publication-title: National Institute of Science of India – ident: e_1_2_9_15_1 doi: 10.1089/15311070260192246 – ident: e_1_2_9_45_1 doi: 10.1126/sciadv.add7925 – ident: e_1_2_9_27_1 doi: 10.2307/1267351 – ident: e_1_2_9_41_1 doi: 10.1038/207568a0 – ident: e_1_2_9_43_1 doi: 10.1038/s41467‐021‐23258‐x – ident: e_1_2_9_20_1 doi: 10.1016/j.patrec.2010.03.014 – ident: e_1_2_9_4_1 doi: 10.1111/j.2517‐6161.1995.tb02031.x – ident: e_1_2_9_40_1 – volume-title: R: A language and environment for statistical computing year: 2024 ident: e_1_2_9_46_1 – volume-title: Springer Series in Statistics year: 2009 ident: e_1_2_9_25_1 – start-page: 785 volume-title: Paper presented at Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining year: 2016 ident: e_1_2_9_8_1 – ident: e_1_2_9_48_1 doi: 10.1089/ast.2020.2394 – ident: e_1_2_9_7_1 doi: 10.1089/ast.2018.1903 – ident: e_1_2_9_51_1 doi: 10.1093/bioinformatics/bth357 – ident: e_1_2_9_22_1 doi: 10.1016/j.chemolab.2011.08.009 – ident: e_1_2_9_50_1 doi: 10.1111/j.2517‐6161.1996.tb02080.x – ident: e_1_2_9_29_1 |
SSID | ssj0003320807 |
Score | 2.3020885 |
Snippet | Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology... |
SourceID | crossref |
SourceType | Index Database |
Title | Detecting Biosignatures in Complex Molecular Mixtures From Pyrolysis‐Gas Chromatography‐Mass Spectrometry Data Using Machine Learning |
Volume | 2 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3LbtNAFB2FsmGDqADxrGYBq8jGHs8k9rI8ilVhxCJI3UXzcolE4yh1UMOi6pYd39ZP4Eu48_DEQUEqbCxrHI8j3-O5d-6cewahF0JIpvJcRJSLcUSJUhFXIx7RQmhGlSwkNwn96uOo_EyPT9jJYHDdYy2tWhHL7zvrSv7HqtAGdjVVsv9g2dApNMA52BeOYGE43sjGb7VZAjCT_dezxjAxrErnuavkM7K_F_DN-u1vh9Xswl09MhUln9bLxqqRBLbDe35upAYaCGG9jHW4VEGEbXeqb424QbtcA1haPnR0g8rSMXWn1Hr6l3D3VDeLDhNeYehLHG7-2mVofJndYrVNESjXEMUqX1QDMeuGXqT5NzfSlbGj_AZGEV-eeUqBFR4fHsbBA3kesq8ZGH6I-8kPwgK7y4-RxNAPIQJxSzt6R5sf5EkPy9lO15EQo7wKD6HHpV3nTjcusqMF_OE5A5_RruSTYtq_-xa6TWDqYnbVqC43eb8sI4mr4g__09djQAev-h30IqVeyDO5h-564-FDB7x9NNDz--hHAB3eAh2ezbEHHQ6gwx3osAEdDqD7dfUT4Ia34QaNBmi4DzRsgIYt0LDHCu6A9gBNjt5N3pSR388jkmOWRkQxkeqEKiFHsk6lSIXWmc6ZU8HjHGYPec1FzscKAlua1UlR67GkvNZFmuvsIdqbN3P9CGGRCCUzxhX4H3hTKc84M1JaRSGTEVH1Y_Sye3XThVNtme6y0ZMb_u4purNB3zO01y5X-jkEo604sEmcA2vj3y8Ikdw |
linkProvider | ISSN International Centre |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Detecting+Biosignatures+in+Complex+Molecular+Mixtures+From+Pyrolysis%E2%80%90Gas+Chromatography%E2%80%90Mass+Spectrometry+Data+Using+Machine+Learning&rft.jtitle=Journal+of+geophysical+research.+Machine+learning+and+computation&rft.au=Hystad%2C+Grethe&rft.au=Cleaves%2C+H.+James&rft.au=Garmon%2C+Collin+A.&rft.au=Wong%2C+Michael+L.&rft.date=2025-09-01&rft.issn=2993-5210&rft.eissn=2993-5210&rft.volume=2&rft.issue=3&rft_id=info:doi/10.1029%2F2024JH000441&rft.externalDBID=n%2Fa&rft.externalDocID=10_1029_2024JH000441 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2993-5210&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2993-5210&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2993-5210&client=summon |