Partially Supervised Learning Using an EM-Boosting Algorithm

Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be su...

Full description

Saved in:

Bibliographic Details
Published in	Biometrics Vol. 60; no. 1; pp. 199 - 206
Main Authors	Yasui, Yutaka, Pepe, Margaret, Hsu, Li, Adam, Bao-Ling, Feng, Ziding
Format	Journal Article
Language	English
Published	350 Main Street , Malden , MA 02148 , U.S.A , and P.O. Box 1354, 9600 Garsington Road , Oxford OX4 2DQ , U.K Blackwell Publishing 01.03.2004 International Biometric Society
Subjects	Algorithms Artificial Intelligence Biomarkers, Tumor - blood Biometrics Biometry Biopsies Blood Proteins - analysis Datasets Epidemiology High-dimensional data Humans Learning disabilities Logistic regression Male Mass Spectrometry Misclassification Prostate cancer Prostatic hyperplasia Prostatic Hyperplasia - blood Prostatic Hyperplasia - diagnosis Prostatic Neoplasms - blood Prostatic Neoplasms - diagnosis Proteomics Test data Training
Online Access	Get full text
ISSN	0006-341X 1541-0420
DOI	10.1111/j.0006-341X.2004.00156.x

Cover

More Information
Summary:	Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classified on the basis of tissue biopsy. Although biopsy is 100% specific in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassification due to imperfect diagnosis certainty arise in many fields. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modification of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artificially mislabeled a subset of samples and applied the original and EM-modified boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassification rates are observed with EM-Boost.
Bibliography:	ArticleID:BIOM156 ark:/67375/WNG-3XDZR4B9-D istex:4DC669225539A7ABF8B4D769F740B5350D8E36C4 ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	0006-341X 1541-0420
DOI:	10.1111/j.0006-341X.2004.00156.x