Automated Determination of the Molecular Substructure from Nuclear Magnetic Resonance Spectra Using Neural Networks
Nuclear magnetic resonance (NMR) spectroscopy is an indispensable tool for determining the structural characteristics of a molecule by analyzing its chemical shifts. A wealth of NMR spectra therefore exists and continues to amass on a daily basis, at an ever-increasing rate owing to the progressive...
Saved in:
Published in | Journal of chemical information and modeling Vol. 65; no. 16; pp. 8435 - 8447 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
United States
American Chemical Society
25.08.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Nuclear magnetic resonance (NMR) spectroscopy is an indispensable tool for determining the structural characteristics of a molecule by analyzing its chemical shifts. A wealth of NMR spectra therefore exists and continues to amass on a daily basis, at an ever-increasing rate owing to the progressive automation of chemical analysis. This growth and automation have led to the data analysis step in NMR spectroscopy becoming the main bottleneck in the structural characterization of a new chemical compound. In particular, the data interpretation step is slow and prone to error as it requires manual examination by a suitably trained scientist. Machine learning (ML) methods could overcome this bottleneck, pending that they can automatically correlate the collection of peaks in an NMR spectrum with the substructure of its subject molecule. This study explores the art of the possible using three types of ML methods that are based on neural-network architectures: a multilayer perceptron (MLP) + long short-term memory (LSTM) neural network, a convolutional neural network (CNN), and an MLP + recurrent neural network (RNN). NMR spectrum–structure correlations were encoded into each type of neural network using two forms of molecular representation, one employing functional groups and the other using a novel neighbor-based method. These models were trained on 34,503 and 17,311 experimental 13C and 1H NMR spectra, respectively. The influence of incorporating metadata about experimental conditions (NMR field strength, temperature, and solvent) into the neural-network model was also investigated. The models incorporated coupling constants as a proxy for spectral intensities in the case of 13C NMR spectra. We found that the MLP + LSTM model achieved the highest accuracy (88%) when trained on 13C NMR spectra and incorporating experimental metadata (compared to 77% without incorporating it). While the CNN model performance was slightly lower (86% accuracy), it determined molecular substructures in one-third of the computational run time compared to the MLP + LSTM model. Thus, the CNN model emerged as the practically best model when considering performance, time, and cost. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1549-9596 1549-960X 1549-960X |
DOI: | 10.1021/acs.jcim.5c00499 |