MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn

The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to...

Full description

Saved in:

Bibliographic Details
Published in	Journal of chemical information and modeling
Main Authors	Sieg, Jochen, Feldmann, Christian W, Hemmerich, Jennifer, Stork, Conrad, Sandfort, Frederik, Eiden, Philipp, Mathea, Miriam
Format	Journal Article
Language	English
Published	United States 17.09.2024
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to cheminformatics by wrapping standard RDKit functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. MolPipeline provides the building blocks to enable seamless integration of common cheminformatics tasks within scikit-learn's pipeline framework, such as scaffold splits and molecular standardization, making pipeline building easily adaptable to diverse project requirements.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1549-960X 1549-960X
DOI:	10.1021/acs.jcim.4c00863