Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular M...

Full description

Saved in:

Bibliographic Details
Published in	Briefings in bioinformatics Vol. 22; no. 4
Main Authors	Wu, Zhenxing, Zhu, Minfeng, Kang, Yu, Leung, Elaine Lai-Han, Lei, Tailong, Shen, Chao, Jiang, Dejun, Wang, Zhe, Cao, Dongsheng, Hou, Tingjun
Format	Journal Article
Language	English
Published	England Oxford University Press 01.07.2021 Oxford Publishing Limited (England)
Subjects	Algorithms Artificial neural networks Computer applications Datasets Gaussian process Learning algorithms Learning theory Least squares method Linear functions Machine learning Neural networks Physicochemical properties Predictions Principal components analysis Radial basis function Regression analysis Spline functions Structure-activity relationships Support vector machines Toxicity ensemble learning support vector machine machine learning QSAR XGBoost
Online Access	Get full text

Cover

Loading…

Abstract	Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms. Graphical abstract
AbstractList	Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms. Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms. Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms. Graphical abstract
Author	Wu, Zhenxing Wang, Zhe Hou, Tingjun Jiang, Dejun Kang, Yu Leung, Elaine Lai-Han Cao, Dongsheng Shen, Chao Zhu, Minfeng Lei, Tailong
Author_xml	– sequence: 1 givenname: Zhenxing surname: Wu fullname: Wu, Zhenxing email: 3140101624@zju.edu.cn – sequence: 2 givenname: Minfeng surname: Zhu fullname: Zhu, Minfeng email: 330561273@qq.com – sequence: 3 givenname: Yu surname: Kang fullname: Kang, Yu email: yukang@zju.edu.cn – sequence: 4 givenname: Elaine Lai-Han surname: Leung fullname: Leung, Elaine Lai-Han email: lhleung@must.edu.mo – sequence: 5 givenname: Tailong orcidid: 0000-0003-2067-1787 surname: Lei fullname: Lei, Tailong email: ltl_1988@126.com – sequence: 6 givenname: Chao surname: Shen fullname: Shen, Chao email: 3130101022@zju.edu.cn – sequence: 7 givenname: Dejun surname: Jiang fullname: Jiang, Dejun email: jiang_dj@zju.edu.cn – sequence: 8 givenname: Zhe surname: Wang fullname: Wang, Zhe email: wangzhehyd@163.com – sequence: 9 givenname: Dongsheng surname: Cao fullname: Cao, Dongsheng email: oriental-cds@163.com – sequence: 10 givenname: Tingjun surname: Hou fullname: Hou, Tingjun email: tingjunhou@zju.edu.cn
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/33313673$$D View this record in MEDLINE/PubMed
BookMark	eNp9kV1LHDEUhoMofrVX3ktAKIJMTSaZyeZKFq1aEEq_rockc-JGZpI1ybT0h_T_mmVXLwS9Skie9-Fw3gO07YMHhI4o-UyJZOfa6XOtlWI13UL7lAtRcdLw7dW9FVXDW7aHDlJ6IKQmYkZ30R5jjLJWsH30_yrgv4A9QI97Zy1E8BmPyiycBzyAit75e6yG-xBdXowJ2xDx95_zH3gMPQzl8wLPsQnjMsICfHJ_AKuUIKVxZQoW0_ZdX_CY8rWxV1nhBDl9QDtWDQk-bs5D9Pv6y6_L2-ru283Xy_ldZTjluZK9lERLo4GTfiZtzWXNoWmFqlsioG2skdxKK7gidia50ZJI3QhTHkqwZofodO1dxvA4Qcrd6JKBYVAewpS6mouyNNYSWtCTV-hDmKIv03V1IwnhTM5IoY431KRH6LtldKOK_7rnhRfgbA2YGFKKYF8QSrpVnV2ps9vUWWj6ijYuq-yCz1G54Y3Mp3UmTMt35U_hQrBN
CitedBy_id	crossref_primary_10_1007_s11030_024_11061_x crossref_primary_10_1016_j_ailsci_2024_100104 crossref_primary_10_3390_ijms241411488 crossref_primary_10_3390_molecules28031342 crossref_primary_10_1016_j_lwt_2023_114433 crossref_primary_10_1021_acs_iecr_4c04008 crossref_primary_10_3389_fbinf_2023_1328262 crossref_primary_10_3389_ftox_2023_1340860 crossref_primary_10_1038_s41598_024_63708_2 crossref_primary_10_1007_s13167_022_00271_8 crossref_primary_10_3390_rs15030854 crossref_primary_10_1016_j_toxlet_2023_10_013 crossref_primary_10_2478_auoc_2024_0011 crossref_primary_10_1016_j_jhazmat_2022_130181 crossref_primary_10_1021_acs_jcim_2c00765 crossref_primary_10_2751_jcac_22_17 crossref_primary_10_3389_fonc_2022_916375 crossref_primary_10_3390_molecules27103112 crossref_primary_10_1021_acs_jcim_1c01163 crossref_primary_10_1007_s11119_023_10042_8 crossref_primary_10_1093_bib_bbac577 crossref_primary_10_1016_j_medidd_2024_100176 crossref_primary_10_1016_j_chemolab_2024_105197 crossref_primary_10_3390_ph17030382 crossref_primary_10_1016_j_ailsci_2024_100114 crossref_primary_10_1016_j_ecoenv_2023_115495 crossref_primary_10_1002_slct_202404214 crossref_primary_10_1016_j_scitotenv_2024_177835 crossref_primary_10_1016_j_isci_2024_109452 crossref_primary_10_3390_diagnostics13030395 crossref_primary_10_1080_10643389_2024_2320753 crossref_primary_10_1016_j_talanta_2022_123861 crossref_primary_10_1093_bib_bbab365 crossref_primary_10_1093_bib_bbac334 crossref_primary_10_1093_bib_bbab242 crossref_primary_10_1186_s13321_024_00937_7 crossref_primary_10_1007_s10822_024_00571_3 crossref_primary_10_1080_21655979_2023_2243416 crossref_primary_10_1155_2022_8704784 crossref_primary_10_3389_fchem_2023_1292027 crossref_primary_10_1021_acs_jmedchem_4c02668 crossref_primary_10_1039_D2NJ02513B crossref_primary_10_1128_msystems_00325_24 crossref_primary_10_1007_s11030_021_10217_3 crossref_primary_10_1002_widm_1441 crossref_primary_10_1016_j_seppur_2024_126954 crossref_primary_10_1115_1_4054691 crossref_primary_10_1016_j_watres_2022_118878 crossref_primary_10_1021_acs_est_2c04400 crossref_primary_10_1007_s13349_022_00587_z crossref_primary_10_1016_j_saa_2025_125767 crossref_primary_10_3390_chemistry4040097 crossref_primary_10_1021_acs_est_1c07413 crossref_primary_10_1016_j_bspc_2024_106110 crossref_primary_10_1097_HM9_0000000000000077 crossref_primary_10_1016_j_envint_2024_109244 crossref_primary_10_1016_j_jece_2024_112473 crossref_primary_10_1186_s13321_025_00952_2 crossref_primary_10_1155_2022_4824575 crossref_primary_10_1016_j_ces_2025_121245 crossref_primary_10_1039_D3EN00585B crossref_primary_10_3389_fchem_2023_1239467 crossref_primary_10_1109_ACCESS_2023_3276942 crossref_primary_10_1186_s13040_024_00378_w crossref_primary_10_1016_j_jhazmat_2024_134326 crossref_primary_10_1016_j_scitotenv_2023_166316 crossref_primary_10_3390_make4030034 crossref_primary_10_1080_07391102_2023_2260879 crossref_primary_10_3390_app12115755 crossref_primary_10_1002_aisy_202300366 crossref_primary_10_1016_j_biortech_2024_132011 crossref_primary_10_1080_07391102_2023_2209650 crossref_primary_10_1155_2022_2679050 crossref_primary_10_1016_j_etdah_2024_100156 crossref_primary_10_1021_acs_chemrestox_4c00248 crossref_primary_10_1016_j_ejps_2023_106403 crossref_primary_10_1002_smll_202204941 crossref_primary_10_1016_j_chemolab_2024_105278 crossref_primary_10_1016_j_watres_2025_123500 crossref_primary_10_1016_j_heliyon_2024_e36373 crossref_primary_10_1021_acs_jcim_4c00457 crossref_primary_10_1021_acs_jmedchem_1c01789 crossref_primary_10_3390_cancers17050903 crossref_primary_10_1016_j_atmosenv_2024_120775 crossref_primary_10_1039_D2VA00182A crossref_primary_10_60084_hjas_v1i1_12 crossref_primary_10_1007_s11356_021_16973_x crossref_primary_10_3390_info16010034 crossref_primary_10_1016_j_arabjc_2022_104204 crossref_primary_10_1016_j_scitotenv_2022_157455 crossref_primary_10_1186_s13321_024_00870_9 crossref_primary_10_1080_00268976_2024_2331620 crossref_primary_10_1021_envhealth_4c00118 crossref_primary_10_1021_acs_chemrestox_1c00443 crossref_primary_10_1016_j_scitotenv_2021_151018 crossref_primary_10_1093_cercor_bhac288
Cites_doi	10.1021/ci980033m 10.1109/4235.585893 10.1021/acs.jcim.9b00801 10.2174/156802608786786624 10.1021/ja01062a035 10.1021/cr0102009 10.1021/jm0105427 10.1007/s10822-011-9519-9 10.4018/978-1-5225-0549-5.ch003 10.1021/ci700016d 10.1021/acs.jcim.9b00541 10.1289/EHP3264 10.1016/j.drudis.2018.06.016 10.1021/ci0500379 10.1021/cr9400976 10.1021/ci060138m 10.1021/acs.chemmater.9b01294 10.1021/jm9602928 10.1021/jm00269a004 10.1088/1749-4699/8/1/014008 10.1021/ci025535l 10.1002/qsar.200610151 10.1021/jm4004285 10.1021/acs.jcim.6b00753 10.1016/j.drudis.2020.03.003 10.1109/TPAMI.2020.3015691 10.1002/minf.201000061 10.1021/ci034160g 10.1016/j.asoc.2017.09.040 10.1021/acs.jcim.6b00088 10.1021/ci600332j 10.1021/acs.jcim.6b00591 10.1021/jm049254b 10.1039/D0CS00098A 10.1080/10629360701843482 10.1021/acs.jafc.8b06596 10.1016/j.envpol.2019.06.081 10.18637/jss.v028.i05 10.1021/jm00280a002 10.1016/j.chemolab.2015.07.009 10.1021/ci600205g 10.1021/acs.jcim.8b00285 10.2174/1389200219666181019094526 10.1021/ci990162i 10.1021/acs.molpharmaceut.8b00110
ContentType	Journal Article
Copyright	The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 2020 The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com. The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Copyright_xml	– notice: The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 2020 – notice: The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com. – notice: The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
DBID	AAYXX CITATION NPM 7QO 7SC 8FD FR3 JQ2 K9. L7M L~C L~D P64 RC3 7X8
DOI	10.1093/bib/bbaa321
DatabaseName	CrossRef PubMed Biotechnology Research Abstracts Computer and Information Systems Abstracts Technology Research Database Engineering Research Database ProQuest Computer Science Collection ProQuest Health & Medical Complete (Alumni) Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Biotechnology and BioEngineering Abstracts Genetics Abstracts MEDLINE - Academic
DatabaseTitle	CrossRef PubMed Genetics Abstracts Biotechnology Research Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Health & Medical Complete (Alumni) Engineering Research Database Advanced Technologies Database with Aerospace Biotechnology and BioEngineering Abstracts Computer and Information Systems Abstracts Professional MEDLINE - Academic
DatabaseTitleList	CrossRef MEDLINE - Academic PubMed Genetics Abstracts
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Biology
EISSN	1477-4054
ExternalDocumentID	33313673 10_1093_bib_bbaa321 10.1093/bib/bbaa321
Genre	Journal Article
GroupedDBID	--- -E4 .2P .I3 0R~ 1TH 23N 2WC 36B 4.4 48X 53G 5GY 5VS 6J9 70D 8VB AAHBH AAIJN AAIMJ AAJKP AAJQQ AAMDB AAMVS AAOGV AAPQZ AAPXW AARHZ AASNB AAUQX AAVAP AAVLN ABDBF ABEUO ABIXL ABJNI ABNKS ABPTD ABQLI ABQTQ ABWST ABXVV ABZBJ ACGFO ACGFS ACGOD ACIWK ACPRK ACUFI ACYTK ADBBV ADEYI ADFTL ADGKP ADGZP ADHKW ADHZD ADOCK ADPDF ADQBN ADRDM ADRIX ADRTK ADVEK ADYVW ADZTZ ADZXQ AECKG AEGPL AEGXH AEJOX AEKKA AEKSI AELWJ AEMDU AEMOZ AENEX AENZO AEPUE AETBJ AEWNT AFFZL AFGWE AFIYH AFOFC AFRAH AFXEN AGINJ AGKEF AGQXC AGSYK AHMBA AHXPO AIAGR AIJHB AJEEA AJEUX AKHUL AKVCP AKWXX ALMA_UNASSIGNED_HOLDINGS ALTZX ALUQC APIBT APWMN ARIXL AXUDD AYOIW AZVOD BAWUL BAYMD BCRHZ BEYMZ BHONS BQDIO BQUQU BSWAC BTQHN C1A C45 CAG CDBKE COF CS3 CZ4 DAKXR DIK DILTD DU5 D~K E3Z EAD EAP EAS EBA EBC EBD EBR EBS EBU EE~ EJD EMB EMK EMOBN EST ESX F5P F9B FHSFR FLIZI FLUFQ FOEOM FQBLK GAUVT GJXCC GX1 H13 H5~ HAR HW0 HZ~ IOX J21 K1G KBUDW KOP KSI KSN M-Z M49 MK~ ML0 N9A NGC NLBLG NMDNZ NOMLY NU- O0~ O9- OAWHX ODMLO OJQWA OK1 OVD OVEED P2P PAFKI PEELM PQQKQ Q1. Q5Y QWB RD5 ROX RPM RUSNO RW1 RXO SV3 TEORI TH9 TJP TLC TOX TR2 TUS W8F WOQ X7H YAYTL YKOAZ YXANX ZKX ZL0 ~91 AAYXX ABEJV ABGNP ABPQP ABXZS ACUHS ACUXJ AHGBF AHQJS ALXQX AMNDL ANAKG CITATION JXSIZ GROUPED_DOAJ NPM 7QO 7SC 8FD FR3 JQ2 K9. L7M L~C L~D P64 RC3 7X8
ID	FETCH-LOGICAL-c414t-9d990b9cbe40d89f24924e567a2607e65fc94f9f74a0f894cb909b57c74a90b23
IEDL.DBID	TOX
ISSN	1467-5463 1477-4054
IngestDate	Fri Jul 11 07:05:30 EDT 2025 Tue Jul 01 11:02:33 EDT 2025 Wed Feb 19 02:30:27 EST 2025 Tue Jul 01 03:39:32 EDT 2025 Thu Apr 24 22:55:30 EDT 2025 Wed Aug 28 03:20:04 EDT 2024
IsPeerReviewed	true
IsScholarly	true
Issue	4
Keywords	ensemble learning support vector machine machine learning QSAR XGBoost
Language	English
License	This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c414t-9d990b9cbe40d89f24924e567a2607e65fc94f9f74a0f894cb909b57c74a90b23
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ORCID	0000-0003-2067-1787
PMID	33313673
PQID	2590043980
PQPubID	26846
ParticipantIDs	proquest_miscellaneous_2470023601 proquest_journals_2590043980 pubmed_primary_33313673 crossref_primary_10_1093_bib_bbaa321 crossref_citationtrail_10_1093_bib_bbaa321 oup_primary_10_1093_bib_bbaa321
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2021-07-01
PublicationDateYYYYMMDD	2021-07-01
PublicationDate_xml	– month: 07 year: 2021 text: 2021-07-01 day: 01
PublicationDecade	2020
PublicationPlace	England
PublicationPlace_xml	– name: England – name: Oxford
PublicationTitle	Briefings in bioinformatics
PublicationTitleAlternate	Brief Bioinform
PublicationYear	2021
Publisher	Oxford University Press Oxford Publishing Limited (England)
Publisher_xml	– name: Oxford University Press – name: Oxford Publishing Limited (England)
References	Bergstra (2021072117041900100_ref49) 2015; 8 Ghasemi (2021072117041900100_ref24) 2018; 23 Sheridan (2021072117041900100_ref30) 2016; 56 Xu (2021072117041900100_ref42) 2002; 42 Cherkasov (2021072117041900100_ref1) 2014; 57 Topliss (2021072117041900100_ref12) 1972; 15 Schroeter (2021072117041900100_ref28) 2007 Yang (2021072117041900100_ref45) 2018; 58 Hewitt (2021072117041900100_ref55) 2007; 47 Wu (2021072117041900100_ref29) 2019; 59 Xiong (2021072117041900100_ref14) 2019; 20 Schwaighofer (2021072117041900100_ref27) 2007; 47 O'Brien (2021072117041900100_ref54) 2005; 48 Hansch (2021072117041900100_ref13) 1973; 16 Gedeck (2021072117041900100_ref19) 2010; 49 Shu (2021072117041900100_ref38) 2019 Heo (2021072117041900100_ref17) 2019; 253 Xie (2021072117041900100_ref39) 2020 Gramatica (2021072117041900100_ref3) 2016; 56 Kuhn (2021072117041900100_ref48) 2008; 28 Mahé (2021072117041900100_ref25) 2006; 46 Chen (2021072117041900100_ref40) 2019; 31 Svetnik (2021072117041900100_ref31) 2005; 45 Yang (2021072117041900100_ref43) 2019; 59 Wang (2021072117041900100_ref44) 2019; 67 Bemis (2021072117041900100_ref41) 1996; 39 Bruce (2021072117041900100_ref26) 2007; 47 Muratov (2021072117041900100_ref5) 2020; 49 Martin (2021072117041900100_ref16) 2012 Tropsha (2021072117041900100_ref51) 2010; 29 Hansch (2021072117041900100_ref4) 1964; 86 Li (2021072117041900100_ref37) 2018; 15 Domingos (2021072117041900100_ref53) 2015 Hansch (2021072117041900100_ref7) 2002; 102 Livingstone (2021072117041900100_ref18) 2000; 40 Xiao (2021072117041900100_ref32) 2002; 45 Vilar (2021072117041900100_ref47) 2008; 8 Jain (2021072117041900100_ref20) 1996; 29 Piir (2021072117041900100_ref2) 2018; 126 Marchese Robinson (2021072117041900100_ref36) 2017; 57 Papa (2021072117041900100_ref34) 2008; 19 Seddon (2021072117041900100_ref11) 2012; 26 Hansch (2021072117041900100_ref6) 1996; 96 Gramatica (2021072117041900100_ref50) 2007; 26 Wolpert (2021072117041900100_ref35) 1997; 1 Zheng (2021072117041900100_ref33) 2000; 40 Ghasemi (2021072117041900100_ref23) 2018; 62 (2021072117041900100_ref46) 2010 Dearden (2021072117041900100_ref8) 2017 Fernández-Delgado (2021072117041900100_ref52) 2014; 15 Byvatov (2021072117041900100_ref22) 2003; 2 Cao (2021072117041900100_ref10) 2015; 146 Svetnik (2021072117041900100_ref21) 2003; 43 D'Souza (2021072117041900100_ref15) 2020; 25 Dearden (2021072117041900100_ref9) 2017; 2
References_xml	– volume-title: TEST (Toxicity Estimation Software Tool) Ver 4.1 year: 2012 ident: 2021072117041900100_ref16 – volume: 40 start-page: 185 issue: 1 year: 2000 ident: 2021072117041900100_ref33 article-title: Novel variable selection quantitative structure− property relationship approach based on the k-nearest-neighbor principle publication-title: J Chem Inf Comput Sci doi: 10.1021/ci980033m – volume: 1 start-page: 67 issue: 1 year: 1997 ident: 2021072117041900100_ref35 article-title: No free lunch theorems for optimization publication-title: IEEE Trans Evol Comput doi: 10.1109/4235.585893 – volume: 59 start-page: 4587 issue: 11 year: 2019 ident: 2021072117041900100_ref29 article-title: ADMET evaluation in drug discovery. 19. Reliable prediction of human cytochrome P450 inhibition using artificial intelligence approaches publication-title: J Chem Inf Model doi: 10.1021/acs.jcim.9b00801 – volume: 8 start-page: 1555 issue: 18 year: 2008 ident: 2021072117041900100_ref47 article-title: Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery publication-title: Curr Top Med Chem doi: 10.2174/156802608786786624 – volume: 86 start-page: 1616 issue: 8 year: 1964 ident: 2021072117041900100_ref4 article-title: p-σ-π analysis. A method for the correlation of biological activity and chemical structure publication-title: J Am Chem Soc doi: 10.1021/ja01062a035 – volume: 102 start-page: 783 issue: 3 year: 2002 ident: 2021072117041900100_ref7 article-title: Chem-bioinformatics: comparative QSAR at the interface between chemistry and biology publication-title: Chem Rev doi: 10.1021/cr0102009 – volume: 45 start-page: 2294 issue: 11 year: 2002 ident: 2021072117041900100_ref32 article-title: Antitumor agents. 213. Modeling of epipodophyllotoxin derivatives using variable selection k nearest neighbor QSAR method publication-title: J Med Chem doi: 10.1021/jm0105427 – volume: 26 start-page: 137 issue: 1 year: 2012 ident: 2021072117041900100_ref11 article-title: Drug design for ever, from hype to hope publication-title: J Comput Aid Mol Des doi: 10.1007/s10822-011-9519-9 – start-page: 67 volume-title: Information Resources Management A. (ed) Oncology: breakthroughs in research and practice year: 2017 ident: 2021072117041900100_ref8 doi: 10.4018/978-1-5225-0549-5.ch003 – volume: 47 start-page: 1460 issue: 4 year: 2007 ident: 2021072117041900100_ref55 article-title: Consensus QSAR models: do the benefits outweigh the complexity? publication-title: J Chem Inf Model doi: 10.1021/ci700016d – volume: 59 start-page: 3714 issue: 9 year: 2019 ident: 2021072117041900100_ref43 article-title: Structural analysis and identification of colloidal aggregators in drug discovery publication-title: J Chem Inf Model doi: 10.1021/acs.jcim.9b00541 – volume: 126 issue: 12 year: 2018 ident: 2021072117041900100_ref2 article-title: Best practices for QSAR model reporting: physical and chemical properties, ecotoxicity, environmental fate, human health, and toxicokinetics endpoints publication-title: Environ Health Perspect doi: 10.1289/EHP3264 – volume: 23 start-page: 1784 issue: 10 year: 2018 ident: 2021072117041900100_ref24 article-title: Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks publication-title: Drug Discov Today doi: 10.1016/j.drudis.2018.06.016 – volume: 45 start-page: 786 issue: 3 year: 2005 ident: 2021072117041900100_ref31 article-title: Boosting: an ensemble learning tool for compound classification and QSAR modeling publication-title: J Chem Inf Model doi: 10.1021/ci0500379 – volume: 96 start-page: 1045 issue: 3 year: 1996 ident: 2021072117041900100_ref6 article-title: Comparative QSAR: toward a deeper understanding of chemicobiological interactions publication-title: Chem Rev doi: 10.1021/cr9400976 – volume: 29 start-page: 31 issue: 3 year: 1996 ident: 2021072117041900100_ref20 article-title: Artificial neural networks: a tutorial publication-title: Computertomographie – volume: 46 start-page: 2003 issue: 5 year: 2006 ident: 2021072117041900100_ref25 article-title: The pharmacophore kernel for virtual screening with support vector machines publication-title: J Chem Inf Model doi: 10.1021/ci060138m – volume: 31 start-page: 3564 issue: 9 year: 2019 ident: 2021072117041900100_ref40 article-title: Graph networks as a universal machine learning framework for molecules and crystals publication-title: Chem Mater doi: 10.1021/acs.chemmater.9b01294 – volume: 39 start-page: 2887 issue: 15 year: 1996 ident: 2021072117041900100_ref41 article-title: The properties of known drugs. 1. Molecular frameworks publication-title: J Med Chem doi: 10.1021/jm9602928 – volume: 15 start-page: 3133 issue: 1 year: 2014 ident: 2021072117041900100_ref52 article-title: Do we need hundreds of classifiers to solve real world classification problems? publication-title: J Mach Learn Res – volume: 16 start-page: 1217 issue: 11 year: 1973 ident: 2021072117041900100_ref13 article-title: Strategy in drug design. Cluster analysis as an aid in the selection of substituents publication-title: J Med Chem doi: 10.1021/jm00269a004 – volume: 2 start-page: 67 issue: 2 year: 2003 ident: 2021072117041900100_ref22 article-title: Support vector machine applications in bioinformatics publication-title: Appl Bioinformatics – volume: 8 issue: 1 year: 2015 ident: 2021072117041900100_ref49 article-title: Hyperopt: a python library for model selection and hyperparameter optimization publication-title: Comput Sci Discov doi: 10.1088/1749-4699/8/1/014008 – volume: 2 start-page: 36 issue: 2 year: 2017 ident: 2021072117041900100_ref9 article-title: The history and development of quantitative structure-activity relationships (QSARs): addendum publication-title: Int. J. Quant. Struct.-Prop. Relatsh. – volume: 42 start-page: 912 issue: 4 year: 2002 ident: 2021072117041900100_ref42 article-title: Using molecular equivalence numbers to visually explore structural features that distinguish chemical libraries publication-title: J Chem Inf Comput Sci doi: 10.1021/ci025535l – volume: 26 start-page: 694 issue: 5 year: 2007 ident: 2021072117041900100_ref50 article-title: Principles of QSAR models validation: internal and external publication-title: QSAR Comb Sci doi: 10.1002/qsar.200610151 – volume: 57 start-page: 4977 issue: 12 year: 2014 ident: 2021072117041900100_ref1 article-title: QSAR modeling: where have you been? Where are you going to? publication-title: J Med Chem doi: 10.1021/jm4004285 – volume: 57 start-page: 1773 issue: 8 year: 2017 ident: 2021072117041900100_ref36 article-title: Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets publication-title: J Chem Inf Model doi: 10.1021/acs.jcim.6b00753 – volume: 25 start-page: 748 issue: 4 year: 2020 ident: 2021072117041900100_ref15 article-title: Machine learning models for drug–target interactions: current knowledge and future directions publication-title: Drug Discov Today doi: 10.1016/j.drudis.2020.03.003 – volume-title: MOE Molecular Simulation Package year: 2010 ident: 2021072117041900100_ref46 – year: 2020 ident: 2021072117041900100_ref39 article-title: MHF-Net: an interpretable deep network for multispectral and hyperspectral image fusion publication-title: Trans Pattern Anal Mach Intell doi: 10.1109/TPAMI.2020.3015691 – volume: 29 start-page: 476 issue: 6–7 year: 2010 ident: 2021072117041900100_ref51 article-title: Best practices for QSAR model development, validation, and exploitation publication-title: Mol Inf doi: 10.1002/minf.201000061 – volume: 43 start-page: 1947 issue: 6 year: 2003 ident: 2021072117041900100_ref21 article-title: Random forest: a classification and regression tool for compound classification and QSAR modeling publication-title: J Chem Inf Comput Sci doi: 10.1021/ci034160g – volume: 62 start-page: 251 year: 2018 ident: 2021072117041900100_ref23 article-title: Deep neural network in QSAR studies using deep belief network publication-title: Appl Soft Comput doi: 10.1016/j.asoc.2017.09.040 – volume: 56 start-page: 1127 issue: 6 year: 2016 ident: 2021072117041900100_ref3 article-title: A historical excursus on the statistical validation parameters for QSAR models: a clarification concerning metrics and terminology publication-title: J Chem Inf Model doi: 10.1021/acs.jcim.6b00088 – volume: 47 start-page: 219 issue: 1 year: 2007 ident: 2021072117041900100_ref26 article-title: Contemporary QSAR classifiers compared publication-title: J Chem Inf Model doi: 10.1021/ci600332j – volume: 56 start-page: 2353 issue: 12 year: 2016 ident: 2021072117041900100_ref30 article-title: Extreme gradient boosting as a method for quantitative structure–activity relationships publication-title: J Chem Inf Model doi: 10.1021/acs.jcim.6b00591 – volume: 48 start-page: 1287 issue: 4 year: 2005 ident: 2021072117041900100_ref54 article-title: Greater than the sum of its parts: combining models for useful ADMET prediction publication-title: J Med Chem doi: 10.1021/jm049254b – volume: 49 start-page: 3525 issue: 11 year: 2020 ident: 2021072117041900100_ref5 article-title: QSAR without borders publication-title: Chem Soc Rev doi: 10.1039/D0CS00098A – volume: 19 start-page: 115 issue: 1–2 year: 2008 ident: 2021072117041900100_ref34 article-title: Prediction of PAH mutagenicity in human cells by QSAR classification publication-title: SAR QSAR Environ Res doi: 10.1080/10629360701843482 – volume: 67 start-page: 1823 issue: 7 year: 2019 ident: 2021072117041900100_ref44 article-title: FungiPAD: a free web tool for compound property evaluation and fungicide-likeness analysis publication-title: J Agric Food Chem doi: 10.1021/acs.jafc.8b06596 – volume: 253 start-page: 29 year: 2019 ident: 2021072117041900100_ref17 article-title: Deep learning driven QSAR model for environmental toxicology: effects of endocrine disrupting chemicals on human health publication-title: Environ Pollut doi: 10.1016/j.envpol.2019.06.081 – start-page: 1265 volume-title: Chemmedchem year: 2007 ident: 2021072117041900100_ref28 article-title: Predicting lipophilicity of drug-discovery molecules using Gaussian process models – volume-title: The Master Algorithm: How the Quest for the Ultimate Learning Machine will Remake our World year: 2015 ident: 2021072117041900100_ref53 – volume: 28 start-page: 1 issue: 5 year: 2008 ident: 2021072117041900100_ref48 article-title: Building predictive models in R using the caret package publication-title: J Stat Softw doi: 10.18637/jss.v028.i05 – volume: 15 start-page: 1006 issue: 10 year: 1972 ident: 2021072117041900100_ref12 article-title: Utilization of operational schemes for analog synthesis in drug design publication-title: J Med Chem doi: 10.1021/jm00280a002 – volume: 146 start-page: 494 year: 2015 ident: 2021072117041900100_ref10 article-title: In silico toxicity prediction of chemicals from EPA toxicity database by kernel fusion-based support vector machines publication-title: Chemom Intel Lab Syst doi: 10.1016/j.chemolab.2015.07.009 – volume: 47 start-page: 407 issue: 2 year: 2007 ident: 2021072117041900100_ref27 article-title: Accurate solubility prediction with error bars for electrolytes: a machine learning approach publication-title: J Chem Inf Model doi: 10.1021/ci600205g – start-page: 1919 year: 2019 ident: 2021072117041900100_ref38 article-title: Meta-weight-net: learning an explicit mapping for sample weighting publication-title: Adv Neural Inf Process Syst – volume: 58 start-page: 1725 issue: 9 year: 2018 ident: 2021072117041900100_ref45 article-title: PADFrag: a database built for the exploration of bioactive fragment space for drug discovery publication-title: J Chem Inf Model doi: 10.1021/acs.jcim.8b00285 – volume: 20 start-page: 229 issue: 3 year: 2019 ident: 2021072117041900100_ref14 article-title: Survey of machine learning techniques for prediction of the isoform specificity of cytochrome P450 substrates publication-title: Curr Drug Metab doi: 10.2174/1389200219666181019094526 – volume: 40 start-page: 195 issue: 2 year: 2000 ident: 2021072117041900100_ref18 article-title: The characterization of chemical structures using molecular properties. A survey publication-title: J Chem Inf Comput Sci doi: 10.1021/ci990162i – volume: 49 start-page: 113 year: 2010 ident: 2021072117041900100_ref19 article-title: Computational analysis of structure–activity relationships. Progress in medicinal chemistry publication-title: Elsevier – volume: 15 start-page: 4336 issue: 10 year: 2018 ident: 2021072117041900100_ref37 article-title: Prediction of human cytochrome P450 inhibition using a multitask deep autoencoder neural network publication-title: Mol Pharm doi: 10.1021/acs.molpharmaceut.8b00110
SSID	ssj0020781
Score	2.593032
Snippet	Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there... Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no... Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no...
SourceID	proquest pubmed crossref oup
SourceType	Aggregation Database Index Database Enrichment Source Publisher
SubjectTerms	Algorithms Artificial neural networks Computer applications Datasets Gaussian process Learning algorithms Learning theory Least squares method Linear functions Machine learning Neural networks Physicochemical properties Predictions Principal components analysis Radial basis function Regression analysis Spline functions Structure-activity relationships Support vector machines Toxicity
Title	Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets
URI	https://www.ncbi.nlm.nih.gov/pubmed/33313673 https://www.proquest.com/docview/2590043980 https://www.proquest.com/docview/2470023601
Volume	22
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwhV3fS8MwEA4iCL6Iv51OPcEnoaxrsqR5kqGOIaioG-ytJFmqgraydoh_iP-vl7YrTIc-55JAvoS7I3ffR8gpXppYMia9UHFMUIxgnjJB7KmQWy4Z1bpQLbm55f0hux51RlWBbLbgC1_Sln7RLa2VokW_OLpfR5E_uBvVeZXjqymbiITn2N2rNrwfc-ccz1wz26-YsvAtvXWyVgWF0C1R3CBLNtkkK6VM5OcW-bpM4cNCgo4GZoImObwVZZAWKt2HJ1CvTymm-s9vGWAkCveP3QcohG5w8By64KrHJ_a5rFgHVVNyQhpDm_-5XppAm5UrupJSyGyebZNh72pw0fcqOQXPsDbLPTlGz6Ol0Zb541DGjiuQ2Q4XCnMaYXknNpLFMhZM-XEomdHSl7ojED-FEwO6Q5aTNLF7BARGBjpgXHFuGR0H2iAIJsBAHR-0sKJBzmZnHZmKa9xJXrxG5Z83jRCYqAKmQU5r4_eSYmOx2TGC9rdFcwZoVL3ELAqcLCpGXaHfICf1ML4h9zGiEptO0YaJgknfxyV2y4tQ70Mpdax2dP_f7Q_IauAKXopa3iZZzidTe4gRS66Pivv6DSMW5xo
linkProvider	Oxford University Press
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Do+we+need+different+machine+learning+algorithms+for+QSAR+modeling%3F+A+comprehensive+assessment+of+16+machine+learning+algorithms+on+14+QSAR+data+sets&rft.jtitle=Briefings+in+bioinformatics&rft.au=Wu%2C+Zhenxing&rft.au=Zhu%2C+Minfeng&rft.au=Kang%2C+Yu&rft.au=Elaine+Lai-Han+Leung&rft.date=2021-07-01&rft.pub=Oxford+Publishing+Limited+%28England%29&rft.issn=1467-5463&rft.eissn=1477-4054&rft.volume=22&rft.issue=4&rft_id=info:doi/10.1093%2Fbib%2Fbbaa321&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1467-5463&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1467-5463&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1467-5463&client=summon