Exploiting ensemble learning to improve prediction of phospholipidosis inducing potential

•Intensive data analysis using machine learning methods.•Molecular descriptors along with structural alerts as features.•Deep learning neural networks as best base learners.•Stacked ensemble with random forest as meta learner performed best. Phospholipidosis is characterized by the presence of exces...

Full description

Saved in:
Bibliographic Details
Published inJournal of theoretical biology Vol. 479; pp. 37 - 47
Main Authors Nath, Abhigyan, Sahu, Gopal Krishna
Format Journal Article
LanguageEnglish
Published England Elsevier Ltd 21.10.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•Intensive data analysis using machine learning methods.•Molecular descriptors along with structural alerts as features.•Deep learning neural networks as best base learners.•Stacked ensemble with random forest as meta learner performed best. Phospholipidosis is characterized by the presence of excessive accumulation of phospholipids in different tissue types (lungs, liver, eyes, kidneys etc.) caused by cationic amphiphilic drugs. Electron microscopy analysis has revealed the presence of lamellar inclusion bodies as the hallmark of phospholipidosis. Some phospholipidosis causing compounds can cause tissue specific inflammatory/retrogressive changes. Reliable and accurate in silico methods could facilitate early screening of phospholipidosis inducing compounds which can subsequently speed up the pharmaceutical drug discovery pipelines. In the present work, stacking ensembles are implemented for combining a number of different base learners to develop predictive models (a total of 256 trained machine learning models were tested) for phospholipidosis inducing compounds using a wide range of molecular descriptors (ChemMine, JOELib, Open babel and RDK descriptors) and structural alerts as input features. The best model consisting of stacked ensemble of machine learning algorithms with random forest as the second level learner outperformed other base and ensemble learners. JOELib descriptors along with structural alerts performed better than the other types of descriptor sets. The best ensemble model achieved an overall accuracy of 88.23%, sensitivity of 86.27%, specificity of 90.20%, mcc of 0.765, auc of 0.896 with 88.21 g-means. To assess the robustness and stability of the best ensemble model, it is further evaluated using stratified 10×10 fold cross validation and holdout testing sets (repeated 10 times) achieving 84.83% mean accuracy with 0.708 mean mcc and 88.46% mean accuracy with 0.771 mean mcc respectively. A comparison of different meta classifiers (Generalized linear regression, Gradient boosting machines, Random forest and Deep learning neural networks) in stacking ensemble revealed that random forest is the better choice for combining multiple classification models.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0022-5193
1095-8541
DOI:10.1016/j.jtbi.2019.07.009