Automated Determination of the Molecular Substructure from Nuclear Magnetic Resonance Spectra Using Neural Networks

Nuclear magnetic resonance (NMR) spectroscopy is an indispensable tool for determining the structural characteristics of a molecule by analyzing its chemical shifts. A wealth of NMR spectra therefore exists and continues to amass on a daily basis, at an ever-increasing rate owing to the progressive...

Full description

Saved in:
Bibliographic Details
Published inJournal of chemical information and modeling Vol. 65; no. 16; pp. 8435 - 8447
Main Authors Liu, Shiyun, Cole, Jacqueline M.
Format Journal Article
LanguageEnglish
Published United States American Chemical Society 25.08.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Nuclear magnetic resonance (NMR) spectroscopy is an indispensable tool for determining the structural characteristics of a molecule by analyzing its chemical shifts. A wealth of NMR spectra therefore exists and continues to amass on a daily basis, at an ever-increasing rate owing to the progressive automation of chemical analysis. This growth and automation have led to the data analysis step in NMR spectroscopy becoming the main bottleneck in the structural characterization of a new chemical compound. In particular, the data interpretation step is slow and prone to error as it requires manual examination by a suitably trained scientist. Machine learning (ML) methods could overcome this bottleneck, pending that they can automatically correlate the collection of peaks in an NMR spectrum with the substructure of its subject molecule. This study explores the art of the possible using three types of ML methods that are based on neural-network architectures: a multilayer perceptron (MLP) + long short-term memory (LSTM) neural network, a convolutional neural network (CNN), and an MLP + recurrent neural network (RNN). NMR spectrum–structure correlations were encoded into each type of neural network using two forms of molecular representation, one employing functional groups and the other using a novel neighbor-based method. These models were trained on 34,503 and 17,311 experimental 13C and 1H NMR spectra, respectively. The influence of incorporating metadata about experimental conditions (NMR field strength, temperature, and solvent) into the neural-network model was also investigated. The models incorporated coupling constants as a proxy for spectral intensities in the case of 13C NMR spectra. We found that the MLP + LSTM model achieved the highest accuracy (88%) when trained on 13C NMR spectra and incorporating experimental metadata (compared to 77% without incorporating it). While the CNN model performance was slightly lower (86% accuracy), it determined molecular substructures in one-third of the computational run time compared to the MLP + LSTM model. Thus, the CNN model emerged as the practically best model when considering performance, time, and cost.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1549-9596
1549-960X
1549-960X
DOI:10.1021/acs.jcim.5c00499