An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF

Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 4; no. 9; p. e7087
Main Authors	Karpievitch, Yuliya V., Hill, Elizabeth G., Leclerc, Anthony P., Dabney, Alan R., Almeida, Jonas S.
Format	Journal Article
Language	English
Published	United States Public Library of Science 18.09.2009 Public Library of Science (PLoS)
Subjects	Algorithms Alzheimer's disease Alzheimers disease Bioinformatics Biomarkers Classification Cluster Analysis Clusters Comparative analysis Computer Simulation Correlation analysis Data mining Data processing Decision trees Downloading Error correction Forests Gene Expression Profiling - methods Genetics and Genomics/Bioinformatics Graphical user interface Indexing Machine learning Mass spectrometry Mass spectroscopy Mathematics/Statistics Models, Genetic Models, Statistical Molecular Biology/Bioinformatics Normal distribution Oligonucleotide Array Sequence Analysis - methods Ovarian cancer Pattern Recognition, Automated - methods Post-processing Post-production processing Proteins Proteomics Resampling Sampling methods Scientific imaging Software Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization - methods Statistical methods Variables Windows (computer programs) United States > US Texas South Carolina
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Conceived and designed the experiments: YVK EGH ARD. Performed the experiments: YVK APL. Analyzed the data: YVK APL JSA. Wrote the paper: YVK EGH APL ARD JSA.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0007087