Partially Supervised Learning Using an EM-Boosting Algorithm

Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be su...

Full description

Saved in:
Bibliographic Details
Published inBiometrics Vol. 60; no. 1; pp. 199 - 206
Main Authors Yasui, Yutaka, Pepe, Margaret, Hsu, Li, Adam, Bao-Ling, Feng, Ziding
Format Journal Article
LanguageEnglish
Published 350 Main Street , Malden , MA 02148 , U.S.A , and P.O. Box 1354, 9600 Garsington Road , Oxford OX4 2DQ , U.K Blackwell Publishing 01.03.2004
International Biometric Society
Subjects
Online AccessGet full text
ISSN0006-341X
1541-0420
DOI10.1111/j.0006-341X.2004.00156.x

Cover

More Information
Summary:Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classified on the basis of tissue biopsy. Although biopsy is 100% specific in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassification due to imperfect diagnosis certainty arise in many fields. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modification of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artificially mislabeled a subset of samples and applied the original and EM-modified boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassification rates are observed with EM-Boost.
Bibliography:ArticleID:BIOM156
ark:/67375/WNG-3XDZR4B9-D
istex:4DC669225539A7ABF8B4D769F740B5350D8E36C4
ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
ISSN:0006-341X
1541-0420
DOI:10.1111/j.0006-341X.2004.00156.x