Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular M...

Full description

Saved in:
Bibliographic Details
Published inBriefings in bioinformatics Vol. 22; no. 4
Main Authors Wu, Zhenxing, Zhu, Minfeng, Kang, Yu, Leung, Elaine Lai-Han, Lei, Tailong, Shen, Chao, Jiang, Dejun, Wang, Zhe, Cao, Dongsheng, Hou, Tingjun
Format Journal Article
LanguageEnglish
Published England Oxford University Press 01.07.2021
Oxford Publishing Limited (England)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms. Graphical abstract
AbstractList Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.
Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.
Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms. Graphical abstract
Author Wu, Zhenxing
Wang, Zhe
Hou, Tingjun
Jiang, Dejun
Kang, Yu
Leung, Elaine Lai-Han
Cao, Dongsheng
Shen, Chao
Zhu, Minfeng
Lei, Tailong
Author_xml – sequence: 1
  givenname: Zhenxing
  surname: Wu
  fullname: Wu, Zhenxing
  email: 3140101624@zju.edu.cn
– sequence: 2
  givenname: Minfeng
  surname: Zhu
  fullname: Zhu, Minfeng
  email: 330561273@qq.com
– sequence: 3
  givenname: Yu
  surname: Kang
  fullname: Kang, Yu
  email: yukang@zju.edu.cn
– sequence: 4
  givenname: Elaine Lai-Han
  surname: Leung
  fullname: Leung, Elaine Lai-Han
  email: lhleung@must.edu.mo
– sequence: 5
  givenname: Tailong
  orcidid: 0000-0003-2067-1787
  surname: Lei
  fullname: Lei, Tailong
  email: ltl_1988@126.com
– sequence: 6
  givenname: Chao
  surname: Shen
  fullname: Shen, Chao
  email: 3130101022@zju.edu.cn
– sequence: 7
  givenname: Dejun
  surname: Jiang
  fullname: Jiang, Dejun
  email: jiang_dj@zju.edu.cn
– sequence: 8
  givenname: Zhe
  surname: Wang
  fullname: Wang, Zhe
  email: wangzhehyd@163.com
– sequence: 9
  givenname: Dongsheng
  surname: Cao
  fullname: Cao, Dongsheng
  email: oriental-cds@163.com
– sequence: 10
  givenname: Tingjun
  surname: Hou
  fullname: Hou, Tingjun
  email: tingjunhou@zju.edu.cn
BackLink https://www.ncbi.nlm.nih.gov/pubmed/33313673$$D View this record in MEDLINE/PubMed
BookMark eNp9kV1LHDEUhoMofrVX3ktAKIJMTSaZyeZKFq1aEEq_rockc-JGZpI1ybT0h_T_mmVXLwS9Skie9-Fw3gO07YMHhI4o-UyJZOfa6XOtlWI13UL7lAtRcdLw7dW9FVXDW7aHDlJ6IKQmYkZ30R5jjLJWsH30_yrgv4A9QI97Zy1E8BmPyiycBzyAit75e6yG-xBdXowJ2xDx95_zH3gMPQzl8wLPsQnjMsICfHJ_AKuUIKVxZQoW0_ZdX_CY8rWxV1nhBDl9QDtWDQk-bs5D9Pv6y6_L2-ru283Xy_ldZTjluZK9lERLo4GTfiZtzWXNoWmFqlsioG2skdxKK7gidia50ZJI3QhTHkqwZofodO1dxvA4Qcrd6JKBYVAewpS6mouyNNYSWtCTV-hDmKIv03V1IwnhTM5IoY431KRH6LtldKOK_7rnhRfgbA2YGFKKYF8QSrpVnV2ps9vUWWj6ijYuq-yCz1G54Y3Mp3UmTMt35U_hQrBN
CitedBy_id crossref_primary_10_1007_s11030_024_11061_x
crossref_primary_10_1016_j_ailsci_2024_100104
crossref_primary_10_3390_ijms241411488
crossref_primary_10_3390_molecules28031342
crossref_primary_10_1016_j_lwt_2023_114433
crossref_primary_10_1021_acs_iecr_4c04008
crossref_primary_10_3389_fbinf_2023_1328262
crossref_primary_10_3389_ftox_2023_1340860
crossref_primary_10_1038_s41598_024_63708_2
crossref_primary_10_1007_s13167_022_00271_8
crossref_primary_10_3390_rs15030854
crossref_primary_10_1016_j_toxlet_2023_10_013
crossref_primary_10_2478_auoc_2024_0011
crossref_primary_10_1016_j_jhazmat_2022_130181
crossref_primary_10_1021_acs_jcim_2c00765
crossref_primary_10_2751_jcac_22_17
crossref_primary_10_3389_fonc_2022_916375
crossref_primary_10_3390_molecules27103112
crossref_primary_10_1021_acs_jcim_1c01163
crossref_primary_10_1007_s11119_023_10042_8
crossref_primary_10_1093_bib_bbac577
crossref_primary_10_1016_j_medidd_2024_100176
crossref_primary_10_1016_j_chemolab_2024_105197
crossref_primary_10_3390_ph17030382
crossref_primary_10_1016_j_ailsci_2024_100114
crossref_primary_10_1016_j_ecoenv_2023_115495
crossref_primary_10_1002_slct_202404214
crossref_primary_10_1016_j_scitotenv_2024_177835
crossref_primary_10_1016_j_isci_2024_109452
crossref_primary_10_3390_diagnostics13030395
crossref_primary_10_1080_10643389_2024_2320753
crossref_primary_10_1016_j_talanta_2022_123861
crossref_primary_10_1093_bib_bbab365
crossref_primary_10_1093_bib_bbac334
crossref_primary_10_1093_bib_bbab242
crossref_primary_10_1186_s13321_024_00937_7
crossref_primary_10_1007_s10822_024_00571_3
crossref_primary_10_1080_21655979_2023_2243416
crossref_primary_10_1155_2022_8704784
crossref_primary_10_3389_fchem_2023_1292027
crossref_primary_10_1021_acs_jmedchem_4c02668
crossref_primary_10_1039_D2NJ02513B
crossref_primary_10_1128_msystems_00325_24
crossref_primary_10_1007_s11030_021_10217_3
crossref_primary_10_1002_widm_1441
crossref_primary_10_1016_j_seppur_2024_126954
crossref_primary_10_1115_1_4054691
crossref_primary_10_1016_j_watres_2022_118878
crossref_primary_10_1021_acs_est_2c04400
crossref_primary_10_1007_s13349_022_00587_z
crossref_primary_10_1016_j_saa_2025_125767
crossref_primary_10_3390_chemistry4040097
crossref_primary_10_1021_acs_est_1c07413
crossref_primary_10_1016_j_bspc_2024_106110
crossref_primary_10_1097_HM9_0000000000000077
crossref_primary_10_1016_j_envint_2024_109244
crossref_primary_10_1016_j_jece_2024_112473
crossref_primary_10_1186_s13321_025_00952_2
crossref_primary_10_1155_2022_4824575
crossref_primary_10_1016_j_ces_2025_121245
crossref_primary_10_1039_D3EN00585B
crossref_primary_10_3389_fchem_2023_1239467
crossref_primary_10_1109_ACCESS_2023_3276942
crossref_primary_10_1186_s13040_024_00378_w
crossref_primary_10_1016_j_jhazmat_2024_134326
crossref_primary_10_1016_j_scitotenv_2023_166316
crossref_primary_10_3390_make4030034
crossref_primary_10_1080_07391102_2023_2260879
crossref_primary_10_3390_app12115755
crossref_primary_10_1002_aisy_202300366
crossref_primary_10_1016_j_biortech_2024_132011
crossref_primary_10_1080_07391102_2023_2209650
crossref_primary_10_1155_2022_2679050
crossref_primary_10_1016_j_etdah_2024_100156
crossref_primary_10_1021_acs_chemrestox_4c00248
crossref_primary_10_1016_j_ejps_2023_106403
crossref_primary_10_1002_smll_202204941
crossref_primary_10_1016_j_chemolab_2024_105278
crossref_primary_10_1016_j_watres_2025_123500
crossref_primary_10_1016_j_heliyon_2024_e36373
crossref_primary_10_1021_acs_jcim_4c00457
crossref_primary_10_1021_acs_jmedchem_1c01789
crossref_primary_10_3390_cancers17050903
crossref_primary_10_1016_j_atmosenv_2024_120775
crossref_primary_10_1039_D2VA00182A
crossref_primary_10_60084_hjas_v1i1_12
crossref_primary_10_1007_s11356_021_16973_x
crossref_primary_10_3390_info16010034
crossref_primary_10_1016_j_arabjc_2022_104204
crossref_primary_10_1016_j_scitotenv_2022_157455
crossref_primary_10_1186_s13321_024_00870_9
crossref_primary_10_1080_00268976_2024_2331620
crossref_primary_10_1021_envhealth_4c00118
crossref_primary_10_1021_acs_chemrestox_1c00443
crossref_primary_10_1016_j_scitotenv_2021_151018
crossref_primary_10_1093_cercor_bhac288
Cites_doi 10.1021/ci980033m
10.1109/4235.585893
10.1021/acs.jcim.9b00801
10.2174/156802608786786624
10.1021/ja01062a035
10.1021/cr0102009
10.1021/jm0105427
10.1007/s10822-011-9519-9
10.4018/978-1-5225-0549-5.ch003
10.1021/ci700016d
10.1021/acs.jcim.9b00541
10.1289/EHP3264
10.1016/j.drudis.2018.06.016
10.1021/ci0500379
10.1021/cr9400976
10.1021/ci060138m
10.1021/acs.chemmater.9b01294
10.1021/jm9602928
10.1021/jm00269a004
10.1088/1749-4699/8/1/014008
10.1021/ci025535l
10.1002/qsar.200610151
10.1021/jm4004285
10.1021/acs.jcim.6b00753
10.1016/j.drudis.2020.03.003
10.1109/TPAMI.2020.3015691
10.1002/minf.201000061
10.1021/ci034160g
10.1016/j.asoc.2017.09.040
10.1021/acs.jcim.6b00088
10.1021/ci600332j
10.1021/acs.jcim.6b00591
10.1021/jm049254b
10.1039/D0CS00098A
10.1080/10629360701843482
10.1021/acs.jafc.8b06596
10.1016/j.envpol.2019.06.081
10.18637/jss.v028.i05
10.1021/jm00280a002
10.1016/j.chemolab.2015.07.009
10.1021/ci600205g
10.1021/acs.jcim.8b00285
10.2174/1389200219666181019094526
10.1021/ci990162i
10.1021/acs.molpharmaceut.8b00110
ContentType Journal Article
Copyright The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 2020
The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Copyright_xml – notice: The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 2020
– notice: The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
– notice: The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
DBID AAYXX
CITATION
NPM
7QO
7SC
8FD
FR3
JQ2
K9.
L7M
L~C
L~D
P64
RC3
7X8
DOI 10.1093/bib/bbaa321
DatabaseName CrossRef
PubMed
Biotechnology Research Abstracts
Computer and Information Systems Abstracts
Technology Research Database
Engineering Research Database
ProQuest Computer Science Collection
ProQuest Health & Medical Complete (Alumni)
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Biotechnology and BioEngineering Abstracts
Genetics Abstracts
MEDLINE - Academic
DatabaseTitle CrossRef
PubMed
Genetics Abstracts
Biotechnology Research Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
ProQuest Health & Medical Complete (Alumni)
Engineering Research Database
Advanced Technologies Database with Aerospace
Biotechnology and BioEngineering Abstracts
Computer and Information Systems Abstracts Professional
MEDLINE - Academic
DatabaseTitleList CrossRef
MEDLINE - Academic
PubMed

Genetics Abstracts
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 1477-4054
ExternalDocumentID 33313673
10_1093_bib_bbaa321
10.1093/bib/bbaa321
Genre Journal Article
GroupedDBID ---
-E4
.2P
.I3
0R~
1TH
23N
2WC
36B
4.4
48X
53G
5GY
5VS
6J9
70D
8VB
AAHBH
AAIJN
AAIMJ
AAJKP
AAJQQ
AAMDB
AAMVS
AAOGV
AAPQZ
AAPXW
AARHZ
AASNB
AAUQX
AAVAP
AAVLN
ABDBF
ABEUO
ABIXL
ABJNI
ABNKS
ABPTD
ABQLI
ABQTQ
ABWST
ABXVV
ABZBJ
ACGFO
ACGFS
ACGOD
ACIWK
ACPRK
ACUFI
ACYTK
ADBBV
ADEYI
ADFTL
ADGKP
ADGZP
ADHKW
ADHZD
ADOCK
ADPDF
ADQBN
ADRDM
ADRIX
ADRTK
ADVEK
ADYVW
ADZTZ
ADZXQ
AECKG
AEGPL
AEGXH
AEJOX
AEKKA
AEKSI
AELWJ
AEMDU
AEMOZ
AENEX
AENZO
AEPUE
AETBJ
AEWNT
AFFZL
AFGWE
AFIYH
AFOFC
AFRAH
AFXEN
AGINJ
AGKEF
AGQXC
AGSYK
AHMBA
AHXPO
AIAGR
AIJHB
AJEEA
AJEUX
AKHUL
AKVCP
AKWXX
ALMA_UNASSIGNED_HOLDINGS
ALTZX
ALUQC
APIBT
APWMN
ARIXL
AXUDD
AYOIW
AZVOD
BAWUL
BAYMD
BCRHZ
BEYMZ
BHONS
BQDIO
BQUQU
BSWAC
BTQHN
C1A
C45
CAG
CDBKE
COF
CS3
CZ4
DAKXR
DIK
DILTD
DU5
D~K
E3Z
EAD
EAP
EAS
EBA
EBC
EBD
EBR
EBS
EBU
EE~
EJD
EMB
EMK
EMOBN
EST
ESX
F5P
F9B
FHSFR
FLIZI
FLUFQ
FOEOM
FQBLK
GAUVT
GJXCC
GX1
H13
H5~
HAR
HW0
HZ~
IOX
J21
K1G
KBUDW
KOP
KSI
KSN
M-Z
M49
MK~
ML0
N9A
NGC
NLBLG
NMDNZ
NOMLY
NU-
O0~
O9-
OAWHX
ODMLO
OJQWA
OK1
OVD
OVEED
P2P
PAFKI
PEELM
PQQKQ
Q1.
Q5Y
QWB
RD5
ROX
RPM
RUSNO
RW1
RXO
SV3
TEORI
TH9
TJP
TLC
TOX
TR2
TUS
W8F
WOQ
X7H
YAYTL
YKOAZ
YXANX
ZKX
ZL0
~91
AAYXX
ABEJV
ABGNP
ABPQP
ABXZS
ACUHS
ACUXJ
AHGBF
AHQJS
ALXQX
AMNDL
ANAKG
CITATION
JXSIZ
GROUPED_DOAJ
NPM
7QO
7SC
8FD
FR3
JQ2
K9.
L7M
L~C
L~D
P64
RC3
7X8
ID FETCH-LOGICAL-c414t-9d990b9cbe40d89f24924e567a2607e65fc94f9f74a0f894cb909b57c74a90b23
IEDL.DBID TOX
ISSN 1467-5463
1477-4054
IngestDate Fri Jul 11 07:05:30 EDT 2025
Tue Jul 01 11:02:33 EDT 2025
Wed Feb 19 02:30:27 EST 2025
Tue Jul 01 03:39:32 EDT 2025
Thu Apr 24 22:55:30 EDT 2025
Wed Aug 28 03:20:04 EDT 2024
IsPeerReviewed true
IsScholarly true
Issue 4
Keywords ensemble learning
support vector machine
machine learning
QSAR
XGBoost
Language English
License This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model
The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c414t-9d990b9cbe40d89f24924e567a2607e65fc94f9f74a0f894cb909b57c74a90b23
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID 0000-0003-2067-1787
PMID 33313673
PQID 2590043980
PQPubID 26846
ParticipantIDs proquest_miscellaneous_2470023601
proquest_journals_2590043980
pubmed_primary_33313673
crossref_primary_10_1093_bib_bbaa321
crossref_citationtrail_10_1093_bib_bbaa321
oup_primary_10_1093_bib_bbaa321
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2021-07-01
PublicationDateYYYYMMDD 2021-07-01
PublicationDate_xml – month: 07
  year: 2021
  text: 2021-07-01
  day: 01
PublicationDecade 2020
PublicationPlace England
PublicationPlace_xml – name: England
– name: Oxford
PublicationTitle Briefings in bioinformatics
PublicationTitleAlternate Brief Bioinform
PublicationYear 2021
Publisher Oxford University Press
Oxford Publishing Limited (England)
Publisher_xml – name: Oxford University Press
– name: Oxford Publishing Limited (England)
References Bergstra (2021072117041900100_ref49) 2015; 8
Ghasemi (2021072117041900100_ref24) 2018; 23
Sheridan (2021072117041900100_ref30) 2016; 56
Xu (2021072117041900100_ref42) 2002; 42
Cherkasov (2021072117041900100_ref1) 2014; 57
Topliss (2021072117041900100_ref12) 1972; 15
Schroeter (2021072117041900100_ref28) 2007
Yang (2021072117041900100_ref45) 2018; 58
Hewitt (2021072117041900100_ref55) 2007; 47
Wu (2021072117041900100_ref29) 2019; 59
Xiong (2021072117041900100_ref14) 2019; 20
Schwaighofer (2021072117041900100_ref27) 2007; 47
O'Brien (2021072117041900100_ref54) 2005; 48
Hansch (2021072117041900100_ref13) 1973; 16
Gedeck (2021072117041900100_ref19) 2010; 49
Shu (2021072117041900100_ref38) 2019
Heo (2021072117041900100_ref17) 2019; 253
Xie (2021072117041900100_ref39) 2020
Gramatica (2021072117041900100_ref3) 2016; 56
Kuhn (2021072117041900100_ref48) 2008; 28
Mahé (2021072117041900100_ref25) 2006; 46
Chen (2021072117041900100_ref40) 2019; 31
Svetnik (2021072117041900100_ref31) 2005; 45
Yang (2021072117041900100_ref43) 2019; 59
Wang (2021072117041900100_ref44) 2019; 67
Bemis (2021072117041900100_ref41) 1996; 39
Bruce (2021072117041900100_ref26) 2007; 47
Muratov (2021072117041900100_ref5) 2020; 49
Martin (2021072117041900100_ref16) 2012
Tropsha (2021072117041900100_ref51) 2010; 29
Hansch (2021072117041900100_ref4) 1964; 86
Li (2021072117041900100_ref37) 2018; 15
Domingos (2021072117041900100_ref53) 2015
Hansch (2021072117041900100_ref7) 2002; 102
Livingstone (2021072117041900100_ref18) 2000; 40
Xiao (2021072117041900100_ref32) 2002; 45
Vilar (2021072117041900100_ref47) 2008; 8
Jain (2021072117041900100_ref20) 1996; 29
Piir (2021072117041900100_ref2) 2018; 126
Marchese Robinson (2021072117041900100_ref36) 2017; 57
Papa (2021072117041900100_ref34) 2008; 19
Seddon (2021072117041900100_ref11) 2012; 26
Hansch (2021072117041900100_ref6) 1996; 96
Gramatica (2021072117041900100_ref50) 2007; 26
Wolpert (2021072117041900100_ref35) 1997; 1
Zheng (2021072117041900100_ref33) 2000; 40
Ghasemi (2021072117041900100_ref23) 2018; 62
(2021072117041900100_ref46) 2010
Dearden (2021072117041900100_ref8) 2017
Fernández-Delgado (2021072117041900100_ref52) 2014; 15
Byvatov (2021072117041900100_ref22) 2003; 2
Cao (2021072117041900100_ref10) 2015; 146
Svetnik (2021072117041900100_ref21) 2003; 43
D'Souza (2021072117041900100_ref15) 2020; 25
Dearden (2021072117041900100_ref9) 2017; 2
References_xml – volume-title: TEST (Toxicity Estimation Software Tool) Ver 4.1
  year: 2012
  ident: 2021072117041900100_ref16
– volume: 40
  start-page: 185
  issue: 1
  year: 2000
  ident: 2021072117041900100_ref33
  article-title: Novel variable selection quantitative structure− property relationship approach based on the k-nearest-neighbor principle
  publication-title: J Chem Inf Comput Sci
  doi: 10.1021/ci980033m
– volume: 1
  start-page: 67
  issue: 1
  year: 1997
  ident: 2021072117041900100_ref35
  article-title: No free lunch theorems for optimization
  publication-title: IEEE Trans Evol Comput
  doi: 10.1109/4235.585893
– volume: 59
  start-page: 4587
  issue: 11
  year: 2019
  ident: 2021072117041900100_ref29
  article-title: ADMET evaluation in drug discovery. 19. Reliable prediction of human cytochrome P450 inhibition using artificial intelligence approaches
  publication-title: J Chem Inf Model
  doi: 10.1021/acs.jcim.9b00801
– volume: 8
  start-page: 1555
  issue: 18
  year: 2008
  ident: 2021072117041900100_ref47
  article-title: Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery
  publication-title: Curr Top Med Chem
  doi: 10.2174/156802608786786624
– volume: 86
  start-page: 1616
  issue: 8
  year: 1964
  ident: 2021072117041900100_ref4
  article-title: p-σ-π analysis. A method for the correlation of biological activity and chemical structure
  publication-title: J Am Chem Soc
  doi: 10.1021/ja01062a035
– volume: 102
  start-page: 783
  issue: 3
  year: 2002
  ident: 2021072117041900100_ref7
  article-title: Chem-bioinformatics: comparative QSAR at the interface between chemistry and biology
  publication-title: Chem Rev
  doi: 10.1021/cr0102009
– volume: 45
  start-page: 2294
  issue: 11
  year: 2002
  ident: 2021072117041900100_ref32
  article-title: Antitumor agents. 213. Modeling of epipodophyllotoxin derivatives using variable selection k nearest neighbor QSAR method
  publication-title: J Med Chem
  doi: 10.1021/jm0105427
– volume: 26
  start-page: 137
  issue: 1
  year: 2012
  ident: 2021072117041900100_ref11
  article-title: Drug design for ever, from hype to hope
  publication-title: J Comput Aid Mol Des
  doi: 10.1007/s10822-011-9519-9
– start-page: 67
  volume-title: Information Resources Management A. (ed) Oncology: breakthroughs in research and practice
  year: 2017
  ident: 2021072117041900100_ref8
  doi: 10.4018/978-1-5225-0549-5.ch003
– volume: 47
  start-page: 1460
  issue: 4
  year: 2007
  ident: 2021072117041900100_ref55
  article-title: Consensus QSAR models: do the benefits outweigh the complexity?
  publication-title: J Chem Inf Model
  doi: 10.1021/ci700016d
– volume: 59
  start-page: 3714
  issue: 9
  year: 2019
  ident: 2021072117041900100_ref43
  article-title: Structural analysis and identification of colloidal aggregators in drug discovery
  publication-title: J Chem Inf Model
  doi: 10.1021/acs.jcim.9b00541
– volume: 126
  issue: 12
  year: 2018
  ident: 2021072117041900100_ref2
  article-title: Best practices for QSAR model reporting: physical and chemical properties, ecotoxicity, environmental fate, human health, and toxicokinetics endpoints
  publication-title: Environ Health Perspect
  doi: 10.1289/EHP3264
– volume: 23
  start-page: 1784
  issue: 10
  year: 2018
  ident: 2021072117041900100_ref24
  article-title: Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks
  publication-title: Drug Discov Today
  doi: 10.1016/j.drudis.2018.06.016
– volume: 45
  start-page: 786
  issue: 3
  year: 2005
  ident: 2021072117041900100_ref31
  article-title: Boosting: an ensemble learning tool for compound classification and QSAR modeling
  publication-title: J Chem Inf Model
  doi: 10.1021/ci0500379
– volume: 96
  start-page: 1045
  issue: 3
  year: 1996
  ident: 2021072117041900100_ref6
  article-title: Comparative QSAR: toward a deeper understanding of chemicobiological interactions
  publication-title: Chem Rev
  doi: 10.1021/cr9400976
– volume: 29
  start-page: 31
  issue: 3
  year: 1996
  ident: 2021072117041900100_ref20
  article-title: Artificial neural networks: a tutorial
  publication-title: Computertomographie
– volume: 46
  start-page: 2003
  issue: 5
  year: 2006
  ident: 2021072117041900100_ref25
  article-title: The pharmacophore kernel for virtual screening with support vector machines
  publication-title: J Chem Inf Model
  doi: 10.1021/ci060138m
– volume: 31
  start-page: 3564
  issue: 9
  year: 2019
  ident: 2021072117041900100_ref40
  article-title: Graph networks as a universal machine learning framework for molecules and crystals
  publication-title: Chem Mater
  doi: 10.1021/acs.chemmater.9b01294
– volume: 39
  start-page: 2887
  issue: 15
  year: 1996
  ident: 2021072117041900100_ref41
  article-title: The properties of known drugs. 1. Molecular frameworks
  publication-title: J Med Chem
  doi: 10.1021/jm9602928
– volume: 15
  start-page: 3133
  issue: 1
  year: 2014
  ident: 2021072117041900100_ref52
  article-title: Do we need hundreds of classifiers to solve real world classification problems?
  publication-title: J Mach Learn Res
– volume: 16
  start-page: 1217
  issue: 11
  year: 1973
  ident: 2021072117041900100_ref13
  article-title: Strategy in drug design. Cluster analysis as an aid in the selection of substituents
  publication-title: J Med Chem
  doi: 10.1021/jm00269a004
– volume: 2
  start-page: 67
  issue: 2
  year: 2003
  ident: 2021072117041900100_ref22
  article-title: Support vector machine applications in bioinformatics
  publication-title: Appl Bioinformatics
– volume: 8
  issue: 1
  year: 2015
  ident: 2021072117041900100_ref49
  article-title: Hyperopt: a python library for model selection and hyperparameter optimization
  publication-title: Comput Sci Discov
  doi: 10.1088/1749-4699/8/1/014008
– volume: 2
  start-page: 36
  issue: 2
  year: 2017
  ident: 2021072117041900100_ref9
  article-title: The history and development of quantitative structure-activity relationships (QSARs): addendum
  publication-title: Int. J. Quant. Struct.-Prop. Relatsh.
– volume: 42
  start-page: 912
  issue: 4
  year: 2002
  ident: 2021072117041900100_ref42
  article-title: Using molecular equivalence numbers to visually explore structural features that distinguish chemical libraries
  publication-title: J Chem Inf Comput Sci
  doi: 10.1021/ci025535l
– volume: 26
  start-page: 694
  issue: 5
  year: 2007
  ident: 2021072117041900100_ref50
  article-title: Principles of QSAR models validation: internal and external
  publication-title: QSAR Comb Sci
  doi: 10.1002/qsar.200610151
– volume: 57
  start-page: 4977
  issue: 12
  year: 2014
  ident: 2021072117041900100_ref1
  article-title: QSAR modeling: where have you been? Where are you going to?
  publication-title: J Med Chem
  doi: 10.1021/jm4004285
– volume: 57
  start-page: 1773
  issue: 8
  year: 2017
  ident: 2021072117041900100_ref36
  article-title: Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets
  publication-title: J Chem Inf Model
  doi: 10.1021/acs.jcim.6b00753
– volume: 25
  start-page: 748
  issue: 4
  year: 2020
  ident: 2021072117041900100_ref15
  article-title: Machine learning models for drug–target interactions: current knowledge and future directions
  publication-title: Drug Discov Today
  doi: 10.1016/j.drudis.2020.03.003
– volume-title: MOE Molecular Simulation Package
  year: 2010
  ident: 2021072117041900100_ref46
– year: 2020
  ident: 2021072117041900100_ref39
  article-title: MHF-Net: an interpretable deep network for multispectral and hyperspectral image fusion
  publication-title: Trans Pattern Anal Mach Intell
  doi: 10.1109/TPAMI.2020.3015691
– volume: 29
  start-page: 476
  issue: 6–7
  year: 2010
  ident: 2021072117041900100_ref51
  article-title: Best practices for QSAR model development, validation, and exploitation
  publication-title: Mol Inf
  doi: 10.1002/minf.201000061
– volume: 43
  start-page: 1947
  issue: 6
  year: 2003
  ident: 2021072117041900100_ref21
  article-title: Random forest: a classification and regression tool for compound classification and QSAR modeling
  publication-title: J Chem Inf Comput Sci
  doi: 10.1021/ci034160g
– volume: 62
  start-page: 251
  year: 2018
  ident: 2021072117041900100_ref23
  article-title: Deep neural network in QSAR studies using deep belief network
  publication-title: Appl Soft Comput
  doi: 10.1016/j.asoc.2017.09.040
– volume: 56
  start-page: 1127
  issue: 6
  year: 2016
  ident: 2021072117041900100_ref3
  article-title: A historical excursus on the statistical validation parameters for QSAR models: a clarification concerning metrics and terminology
  publication-title: J Chem Inf Model
  doi: 10.1021/acs.jcim.6b00088
– volume: 47
  start-page: 219
  issue: 1
  year: 2007
  ident: 2021072117041900100_ref26
  article-title: Contemporary QSAR classifiers compared
  publication-title: J Chem Inf Model
  doi: 10.1021/ci600332j
– volume: 56
  start-page: 2353
  issue: 12
  year: 2016
  ident: 2021072117041900100_ref30
  article-title: Extreme gradient boosting as a method for quantitative structure–activity relationships
  publication-title: J Chem Inf Model
  doi: 10.1021/acs.jcim.6b00591
– volume: 48
  start-page: 1287
  issue: 4
  year: 2005
  ident: 2021072117041900100_ref54
  article-title: Greater than the sum of its parts: combining models for useful ADMET prediction
  publication-title: J Med Chem
  doi: 10.1021/jm049254b
– volume: 49
  start-page: 3525
  issue: 11
  year: 2020
  ident: 2021072117041900100_ref5
  article-title: QSAR without borders
  publication-title: Chem Soc Rev
  doi: 10.1039/D0CS00098A
– volume: 19
  start-page: 115
  issue: 1–2
  year: 2008
  ident: 2021072117041900100_ref34
  article-title: Prediction of PAH mutagenicity in human cells by QSAR classification
  publication-title: SAR QSAR Environ Res
  doi: 10.1080/10629360701843482
– volume: 67
  start-page: 1823
  issue: 7
  year: 2019
  ident: 2021072117041900100_ref44
  article-title: FungiPAD: a free web tool for compound property evaluation and fungicide-likeness analysis
  publication-title: J Agric Food Chem
  doi: 10.1021/acs.jafc.8b06596
– volume: 253
  start-page: 29
  year: 2019
  ident: 2021072117041900100_ref17
  article-title: Deep learning driven QSAR model for environmental toxicology: effects of endocrine disrupting chemicals on human health
  publication-title: Environ Pollut
  doi: 10.1016/j.envpol.2019.06.081
– start-page: 1265
  volume-title: Chemmedchem
  year: 2007
  ident: 2021072117041900100_ref28
  article-title: Predicting lipophilicity of drug-discovery molecules using Gaussian process models
– volume-title: The Master Algorithm: How the Quest for the Ultimate Learning Machine will Remake our World
  year: 2015
  ident: 2021072117041900100_ref53
– volume: 28
  start-page: 1
  issue: 5
  year: 2008
  ident: 2021072117041900100_ref48
  article-title: Building predictive models in R using the caret package
  publication-title: J Stat Softw
  doi: 10.18637/jss.v028.i05
– volume: 15
  start-page: 1006
  issue: 10
  year: 1972
  ident: 2021072117041900100_ref12
  article-title: Utilization of operational schemes for analog synthesis in drug design
  publication-title: J Med Chem
  doi: 10.1021/jm00280a002
– volume: 146
  start-page: 494
  year: 2015
  ident: 2021072117041900100_ref10
  article-title: In silico toxicity prediction of chemicals from EPA toxicity database by kernel fusion-based support vector machines
  publication-title: Chemom Intel Lab Syst
  doi: 10.1016/j.chemolab.2015.07.009
– volume: 47
  start-page: 407
  issue: 2
  year: 2007
  ident: 2021072117041900100_ref27
  article-title: Accurate solubility prediction with error bars for electrolytes: a machine learning approach
  publication-title: J Chem Inf Model
  doi: 10.1021/ci600205g
– start-page: 1919
  year: 2019
  ident: 2021072117041900100_ref38
  article-title: Meta-weight-net: learning an explicit mapping for sample weighting
  publication-title: Adv Neural Inf Process Syst
– volume: 58
  start-page: 1725
  issue: 9
  year: 2018
  ident: 2021072117041900100_ref45
  article-title: PADFrag: a database built for the exploration of bioactive fragment space for drug discovery
  publication-title: J Chem Inf Model
  doi: 10.1021/acs.jcim.8b00285
– volume: 20
  start-page: 229
  issue: 3
  year: 2019
  ident: 2021072117041900100_ref14
  article-title: Survey of machine learning techniques for prediction of the isoform specificity of cytochrome P450 substrates
  publication-title: Curr Drug Metab
  doi: 10.2174/1389200219666181019094526
– volume: 40
  start-page: 195
  issue: 2
  year: 2000
  ident: 2021072117041900100_ref18
  article-title: The characterization of chemical structures using molecular properties. A survey
  publication-title: J Chem Inf Comput Sci
  doi: 10.1021/ci990162i
– volume: 49
  start-page: 113
  year: 2010
  ident: 2021072117041900100_ref19
  article-title: Computational analysis of structure–activity relationships. Progress in medicinal chemistry
  publication-title: Elsevier
– volume: 15
  start-page: 4336
  issue: 10
  year: 2018
  ident: 2021072117041900100_ref37
  article-title: Prediction of human cytochrome P450 inhibition using a multitask deep autoencoder neural network
  publication-title: Mol Pharm
  doi: 10.1021/acs.molpharmaceut.8b00110
SSID ssj0020781
Score 2.593032
Snippet Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there...
Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no...
Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no...
SourceID proquest
pubmed
crossref
oup
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
SubjectTerms Algorithms
Artificial neural networks
Computer applications
Datasets
Gaussian process
Learning algorithms
Learning theory
Least squares method
Linear functions
Machine learning
Neural networks
Physicochemical properties
Predictions
Principal components analysis
Radial basis function
Regression analysis
Spline functions
Structure-activity relationships
Support vector machines
Toxicity
Title Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets
URI https://www.ncbi.nlm.nih.gov/pubmed/33313673
https://www.proquest.com/docview/2590043980
https://www.proquest.com/docview/2470023601
Volume 22
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwhV3fS8MwEA4iCL6Iv51OPcEnoaxrsqR5kqGOIaioG-ytJFmqgraydoh_iP-vl7YrTIc-55JAvoS7I3ffR8gpXppYMia9UHFMUIxgnjJB7KmQWy4Z1bpQLbm55f0hux51RlWBbLbgC1_Sln7RLa2VokW_OLpfR5E_uBvVeZXjqymbiITn2N2rNrwfc-ccz1wz26-YsvAtvXWyVgWF0C1R3CBLNtkkK6VM5OcW-bpM4cNCgo4GZoImObwVZZAWKt2HJ1CvTymm-s9vGWAkCveP3QcohG5w8By64KrHJ_a5rFgHVVNyQhpDm_-5XppAm5UrupJSyGyebZNh72pw0fcqOQXPsDbLPTlGz6Ol0Zb541DGjiuQ2Q4XCnMaYXknNpLFMhZM-XEomdHSl7ojED-FEwO6Q5aTNLF7BARGBjpgXHFuGR0H2iAIJsBAHR-0sKJBzmZnHZmKa9xJXrxG5Z83jRCYqAKmQU5r4_eSYmOx2TGC9rdFcwZoVL3ELAqcLCpGXaHfICf1ML4h9zGiEptO0YaJgknfxyV2y4tQ70Mpdax2dP_f7Q_IauAKXopa3iZZzidTe4gRS66Pivv6DSMW5xo
linkProvider Oxford University Press
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Do+we+need+different+machine+learning+algorithms+for+QSAR+modeling%3F+A+comprehensive+assessment+of+16+machine+learning+algorithms+on+14+QSAR+data+sets&rft.jtitle=Briefings+in+bioinformatics&rft.au=Wu%2C+Zhenxing&rft.au=Zhu%2C+Minfeng&rft.au=Kang%2C+Yu&rft.au=Elaine+Lai-Han+Leung&rft.date=2021-07-01&rft.pub=Oxford+Publishing+Limited+%28England%29&rft.issn=1467-5463&rft.eissn=1477-4054&rft.volume=22&rft.issue=4&rft_id=info:doi/10.1093%2Fbib%2Fbbaa321&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1467-5463&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1467-5463&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1467-5463&client=summon