On Some Aspects of Variable Selection for Partial Least Squares Regression Models
This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K‐means clustering technique applied on standardized descriptor matrix and ten combinations...
Saved in:
Published in | QSAR & combinatorial science Vol. 27; no. 3; pp. 302 - 313 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Weinheim
WILEY-VCH Verlag
01.03.2008
WILEY‐VCH Verlag |
Subjects | |
Online Access | Get full text |
ISSN | 1611-020X 1611-0218 |
DOI | 10.1002/qsar.200710043 |
Cover
Summary: | This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K‐means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters. For a particular training set, PLS models were developed with a number of components optimized by leave‐one‐out Q2 and then the developed models were validated (externally) using the test set compounds. For each set, PLS model was initially constructed using all descriptors (variables). The variables having least standardized values of regression coefficients were deleted and the next model was developed with a reduced set of variables. These steps were performed several times until further reduction in number of variables did not improve Q2 value. In each case, statistical parameters like predictive R2 (R2pred), squared correlation coefficient between observed and predicted values with (r2) and without ($\rm{ r_0^{\rm{2}} }$) intercept and Root Mean Square Error of Prediction (RMSEP) were calculated from the test set compounds. In case of all ten sets, Q2 values steadily increase on deletion of variables while R2pred values do not show any specific trend. In no case, the highest Q2 and highest R2pred appear in the same trial, i.e., with the same combinations of variables. This suggests that from the viewpoint of external predictability, choice of variables for PLS based on Q2 value may not be optimum. Moreover, a clear separation of r2 and r02 curves in some sets suggests that such models may not be truly predictive in spite of acceptable R2pred values. Another observation is that coefficient of determination R2 for the training set is more immune to changes on deletion of variables than the validation parameters like Q2 and R2pred. Finally, a new parameter rm2 has been suggested to indicate external predictability of QSAR models. |
---|---|
Bibliography: | istex:F75AFE07DEA41E8DF28015AF8FF003DC63683202 ArticleID:QSAR200710043 ark:/67375/WNG-9M5VGJ0L-X |
ISSN: | 1611-020X 1611-0218 |
DOI: | 10.1002/qsar.200710043 |