Effects of data pre-processing methods on classification of ATR-FTIR spectra of pen inks using partial least squares-discriminant analysis (PLS-DA)

In response to our review paper [L.C. Lee et al., Chemom. Intell. Lab. Systs. 163 (2017) 64–75], we present a study that explores practical impacts of data preprocessing (DP) methods in ATR-FTIR spectra. Nine common DP methods, i.e. mean centering (MC), autoscaling (AS), Pareto scaling, robust scali...

Full description

Saved in:
Bibliographic Details
Published inChemometrics and intelligent laboratory systems Vol. 182; pp. 90 - 100
Main Authors Lee, Loong Chuen, Liong, Choong-Yeun, Jemain, Abdul Aziz
Format Journal Article
LanguageEnglish
Published Elsevier B.V 15.11.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In response to our review paper [L.C. Lee et al., Chemom. Intell. Lab. Systs. 163 (2017) 64–75], we present a study that explores practical impacts of data preprocessing (DP) methods in ATR-FTIR spectra. Nine common DP methods, i.e. mean centering (MC), autoscaling (AS), Pareto scaling, robust scaling, multiplicative scatter correction (MSC), normalization to sum (NS), normalization to constant vector length (NV), standard normal variate and asymmetric least squares (AsLS), were chosen on the sake of their availability in the R software and the rather simple computation steps. An ATR-FTIR spectral dataset of blue gel pen inks that is originated from 10 different manufacturers (i.e. brands) was used in this work. The dataset is colossal (N = 1361), high dimensional (J = 5401), multi-class (C = 10), and imbalanced. In order to examine the impacts of substrate interferences, the global spectral region was further divided, arbitrarily, into three mutually exclusive local regions and analyzed independently. Following that, the resulting four sub-datasets (i.e. one based on global and three based on local regions) were preprocessed via the DP methods independently to produce 40 different sub-datasets including the raw counterparts. Partial least squares-discriminant analysis (PLS-DA) was chosen to construct a series of 50 models by including the first 50 PLS components incrementally. The modeling was performed independently for each of the 40 sub-datasets. Each model was evaluated repeatedly using autoprediction, six variants of v-fold cross validation (v = 2, 4, 5, 7, 10, 15) and external testing schemes. As a results, empirical performances of each DP methods are represented by 400 different error rates (8 model validation schemes × 50 models). Performances of each DP method was then compared against its raw counterparts according to summary statistics and hypothesis tests. In addition, principal component analysis and hierarchical clustering analysis were also employed, respectively, to illustrate the spatial distribution and the similarity between the nine DP methods and the raw counterparts. Several important remarks have been drawn from the rigorous comparative analyses. First, due to the inherent properties of ATR-FTIR spectra, DP methods that handling slope, e.g. MSC and AsLS, have appeared to be the most excellent DP methods. Second, normalization methods, either NS or NV, ranked the second best-performing DP method. Third, MC shows no impact on the raw IR spectral dataset. Fourth, it is shown that outliers in the ATR-FTIR spectra of pen inks could be localized. Last but not least, removal of irrelevant signals arising from sample substrate is best achieved via region truncation rather than via PLS or DP methods alone. •ATR-FTIR spectra of pen inks is seriously overlapped with substrate interference.•ATR-FTIR spectra is best preprocessed via slope correcting algorithms.•Normalization methods are the second best-performing data preprocessing method.•Mean centering shows no effect on ATR-FTIR spectral dataset.•Outliers in the ATR-FTIR spectra of pen inks could be localized.
ISSN:0169-7439
1873-3239
DOI:10.1016/j.chemolab.2018.09.001