An integrated approach for identifying wrongly labelled samples when performing classification in microarray data

Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may b...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 7; no. 10; p. e46700
Main Authors	Leung, Yuk Yee, Chang, Chun Qi, Hung, Yeung Sam
Format	Journal Article
Language	English
Published	United States Public Library of Science 17.10.2012 Public Library of Science (PLoS)
Subjects	Algorithms Biological specimens Biology Breast cancer Classification Computer Science Data analysis Data processing Databases, Genetic Datasets DNA microarrays Engineering Gene expression Genes Growth factors Humans Labeling Labels Leukemia Medicine Microarray Analysis - classification Microarray Analysis - methods Outliers (statistics) Staining and Labeling Statistics as Topic - methods Studies Hong Kong United States > US China Pennsylvania
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the 'wrong' (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Conceived and designed the experiments: YYL CQC YSH. Performed the experiments: YYL. Analyzed the data: YYL. Contributed reagents/materials/analysis tools: YYL. Wrote the paper: YYL CQC YSH. Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0046700