Detecting Biosignatures in Complex Molecular Mixtures From Pyrolysis‐Gas Chromatography‐Mass Spectrometry Data Using Machine Learning

Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass sp...

Full description

Saved in:
Bibliographic Details
Published inJournal of geophysical research. Machine learning and computation Vol. 2; no. 3
Main Authors Hystad, Grethe, Cleaves, H. James, Garmon, Collin A., Wong, Michael L., Prabhu, Anirudh, Cody, George D., Hazen, Robert M.
Format Journal Article
LanguageEnglish
Published 01.09.2025
Online AccessGet full text

Cover

Loading…
More Information
Summary:Understanding how measured molecular signals can distinguish the chemistry of life from the chemistry of the nonliving world is a central focus of astrobiology and paleobiology. We train and compare several machine learning (ML) classification models on data from pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been employed in space missions. We analyzed various organic carbon‐bearing geomaterials to consider relationships among suites of molecules that can help identify their biogenicity and potentially be used to analyze data from various solar system exploration missions. These supervised classification models can discriminate between abiotic and biotic samples with ∼86–89% accuracy. We use and compare 4 different ML models, coupled with range of statistical and visualization methods, to investigate the patterns and distribution of diagnostic features— specific combinations of chromatographic retention time and mass‐to‐charge ratio, which contribute to the classification of the samples into biologically derived versus abiologically derived materials. These diagnostic discriminators are common in biotic samples and rare in most abiotic samples and hence point to a potential agnostic molecular biosignature. They also tend to have higher normalized intensity values in biologically derived materials and display different distributions in contemporary biotic samples compared to taphonomically altered biotic samples. We utilize the full resolution of the 3D structure of the py‐GC‐MS data and describe in detail the preprocessing steps and the ML pipeline for analyzing such data, which could be automated for future data collection. Astrobiology and paleobiology are concerned with determining what distinguishes the chemistry of life from the chemistry of the nonliving world. We hypothesize that the diversity and distribution of molecules in biologically derived materials (e.g., plants, animal tissue, bacteria, and coal) are different than those in abiotic materials (e.g., carbon‐rich meteorites and laboratory‐made synthetic reactions). To test this hypothesis, we analyzed a diverse collection of natural and synthetic organic molecular mixtures using pyrolysis‐gas chromatography‐mass spectrometry (py‐GC‐MS)—a widely available analytical method that has been used in solar system exploration missions. In py‐GC‐MS, samples are heated, decomposed into smaller components, and separated into fragment ions for molecular identification. We train and compare several machine learning classification models to predict the biogenicity of the samples and to determine the patterns and distribution of features—specific combinations of chromatographic retention time and mass‐to‐charge ratio that are important for distinguishing biologically derived samples from abiotic ones. These diagnostic features are both more commonly present and occur in greater abundance in biotic samples than abiotic samples, and hence serve as potential molecular biosignatures. Machine learning is applied to pyrolysis‐gas chromatography‐mass spectrometry to predict the biogenicity in various carbonaceous materials Diagnostic features for discriminating biologically derived samples from abiotic samples have been identified Potential molecular features identified as diagnostic biochemical discriminators are common in biotic samples and rare in most abiotic ones
ISSN:2993-5210
2993-5210
DOI:10.1029/2024JH000441