Matched pairs demonstrate robustness against inter-assay variability

Machine learning models for chemistry require large datasets, often compiled by combining data from multiple assays. However, combining data without careful curation can introduce significant noise. While absolute values from different assays are rarely comparable, trends or differences between comp...

Full description

Saved in:

Bibliographic Details
Published in	Journal of cheminformatics Vol. 17; no. 1; p. 8
Main Authors	Nelen, Jochem, Pérez-Sánchez, Horacio, De Winter, Hans, Van Rompaey, Dries
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 20.01.2025 BioMed Central Ltd Springer Nature B.V BMC
Subjects	Assay noise Assaying Brief Report ChEMBL Chemistry Chemistry and Materials Science Computational Biology/Bioinformatics Computer Applications in Chemistry Data curation Datasets Documentation and Information in Chemistry Error reduction Impact analysis Information management Machine learning Matched structural pairs Metadata Noise reduction Quality assessment Quality control Theoretical and Computational Chemistry Data curation ChEMBL Matched structural pairs Assay noise Machine learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Machine learning models for chemistry require large datasets, often compiled by combining data from multiple assays. However, combining data without careful curation can introduce significant noise. While absolute values from different assays are rarely comparable, trends or differences between compounds are often assumed to be consistent. This study evaluates that assumption by analyzing potency differences between matched compound pairs across assays and assessing the impact of assay metadata curation on error reduction. We find that potency differences between matched pairs exhibit less variability than individual compound measurements, suggesting systematic assay differences may partially cancel out in paired data. Metadata curation further improves inter-assay agreement, albeit at the cost of dataset size. For minimally curated compound pairs, agreement within 0.3 pChEMBL units was found to be 44–46% for K i and IC 50 values respectively, which improved to 66–79% after curation. Similarly, the percentage of pairs with differences exceeding 1 pChEMBL unit dropped from 12 to 15% to 6–8% with extensive curation. These results establish a benchmark for expected noise in matched molecular pair data from the ChEMBL database, offering practical metrics for data quality assessment.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1758-2946 1758-2946
DOI:	10.1186/s13321-025-00956-y