Model-Based Clustering and Prediction With Mixed Measurements Involving Surrogate Classifiers

Identification of underlying subpopulations to account for unobserved heterogeneity in the population is a challenging statistical problem, mainly because no explicit information about the latent classes is available. Although latent class analysis via finite mixture models is often used successfull...

Full description

Saved in:
Bibliographic Details
Published inStatistics in biopharmaceutical research Vol. 14; no. 3; pp. 368 - 379
Main Authors Shen, Hua, de Leon, Alexander R.
Format Journal Article
LanguageEnglish
Published Taylor & Francis 03.08.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Identification of underlying subpopulations to account for unobserved heterogeneity in the population is a challenging statistical problem, mainly because no explicit information about the latent classes is available. Although latent class analysis via finite mixture models is often used successfully to probabilistically identify subpopulations in applications, it often fails with data for which such subpopulations exhibit high latency. Borrowing strength from readily accessible auxiliary classifiers, even when subject to misclassification, may yield improved results in such settings. We develop in this article a joint modeling approach that combines data from multiple sources, including observed characteristics that are often used alone for clustering and classification, as well as results based on imperfect surrogate classifiers, to better identify the latent classes for more accurate classification and prediction. We outline maximum likelihood estimation for the joint model using the EM algorithm, and we show empirically via simulations that our methodology yields better estimates of the underlying latent class distributions than those obtained by ignoring the auxiliary information, while providing joint assessments of the surrogate classifiers. The advantages are significant when there is high latency and the surrogate classifiers are at least moderately accurate. We use real diagnostic data on dry eye disease, for which no gold standard is available, to illustrate our methodology.
ISSN:1946-6315
1946-6315
DOI:10.1080/19466315.2020.1863257