Impact of Interval Censoring on Data Accuracy and Machine Learning Performance in Biological High-Throughput Screening

High-throughput screening (HTS) combined with deep mutational scanning (DMS) and next-generation DNA sequencing (NGS) have great potential to accelerate discovery and optimization of biological therapeutics. Typical workflows involve generation of a mutagenized variant library, screening/selection o...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Doffini, Vanni, Nash, Michael A.
Format Paper
LanguageEnglish
Published Cold Spring Harbor Laboratory 28.10.2024
Edition1.3
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:High-throughput screening (HTS) combined with deep mutational scanning (DMS) and next-generation DNA sequencing (NGS) have great potential to accelerate discovery and optimization of biological therapeutics. Typical workflows involve generation of a mutagenized variant library, screening/selection of variants based on phenotypic fitness, and comprehensive analysis of binned variant populations by NGS. However, in such cases, the HTS data are subject to interval censoring, where each fitness value is calculated based on the assignment of variants to bins. Such censoring leads to increased uncertainty, which can impact data accuracy and, consequently, the performance of machine learning (ML) algorithms tasked with predicting sequence-fitness pairings. Here, we investigated the impact of interval censoring on data quality and ML performance in biological HTS experiments. We theoretically analyzed the impact of data censoring and propose a dimensionless number, the Ratio of Discretization (RD), to assist in optimizing HTS parameters such as the bin width and the sampling size. This approach can be used to minimize errors in fitness prediction by ML and to improve the reliability of these methods. These findings are not limited to biological HTS techniques and can be applied to other systems where interval censoring is an advantageous measurement strategy.
Bibliography:Competing Interest Statement: The authors have declared no competing interest.
ISSN:2692-8205
DOI:10.1101/2024.09.25.615059