Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed....

Full description

Saved in:

Bibliographic Details
Published in	Journal of chemical information and modeling Vol. 59; no. 4; pp. 1645 - 1657
Main Authors	Lopez-del Rio, Angela, Nonell-Canals, Alfons, Vidal, David, Perera-Lluna, Alexandre
Format	Journal Article Publication
Language	English
Published	United States American Chemical Society 22.04.2019
Subjects	Aplicacions de la informàtica Aplicacions informàtiques a la física i l‘enginyeria Aprenentatge automàtic Artificial neural networks Binding Classification Cluster analysis Clustering Deep learning Deep neural networks Fingerprints Informàtica Machine learning Model testing Regression models Splitting Vector quantization Àrees temàtiques de la UPC
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database, and (4) splitting based both in the clustering and in the source database. These schemas are applied to a deep learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our deep learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compound clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1549-9596 1549-960X 1549-960X
DOI:	10.1021/acs.jcim.8b00663