METCC: METric learning for Confounder Control Making distance matter in high dimensional biological analysis
High-dimensional data acquired from biological experiments such as next generation sequencing are subject to a number of confounding effects. These effects include both technical effects, such as variation across batches from instrument noise or sample processing, or institution-specific differences...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
07.12.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | High-dimensional data acquired from biological experiments such as next
generation sequencing are subject to a number of confounding effects. These
effects include both technical effects, such as variation across batches from
instrument noise or sample processing, or institution-specific differences in
sample acquisition and physical handling, as well as biological effects arising
from true but irrelevant differences in the biology of each sample, such as age
biases in diseases. Prior work has used linear methods to adjust for such batch
effects. Here, we apply contrastive metric learning by a non-linear triplet
network to optimize the ability to distinguish biologically distinct sample
classes in the presence of irrelevant technical and biological variation. Using
whole-genome cell-free DNA data from 817 patients, we demonstrate that our
approach, METric learning for Confounder Control (METCC), is able to match or
exceed the classification performance achieved using a best-in-class linear
method (HCP) or no normalization. Critically, results from METCC appear less
confounded by irrelevant technical variables like institution and batch than
those from other methods even without access to high quality metadata
information required by many existing techniques; offering hope for improved
generalization. |
---|---|
Bibliography: | ML4H/2018/211 |
DOI: | 10.48550/arxiv.1812.03188 |