Benchmarking the robustness of the correct identification of flexible 3D objects using common machine learning models
True three-dimensional (3D) data are prevalent in domains such as molecular science or computer vision. In these data, machine learning models are often asked to identify objects subject to intrinsic flexibility. Our study introduces two datasets from molecular science to assess the classification r...
Saved in:
Published in | Patterns (New York, N.Y.) Vol. 6; no. 1; p. 101147 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
United States
Elsevier Inc
10.01.2025
Elsevier |
Subjects | |
Online Access | Get full text |
ISSN | 2666-3899 2666-3899 |
DOI | 10.1016/j.patter.2024.101147 |
Cover
Summary: | True three-dimensional (3D) data are prevalent in domains such as molecular science or computer vision. In these data, machine learning models are often asked to identify objects subject to intrinsic flexibility. Our study introduces two datasets from molecular science to assess the classification robustness of common model/feature combinations. Molecules are flexible, and shapes alone offer intra-class heterogeneities that yield a high risk for confusions. By blocking training and test sets to reduce overlap, we establish a baseline requiring the trained models to abstract from shape. As training data coverage grows, all tested architectures perform better on unseen data with reduced overfitting. Empirically, 2D embeddings of voxelized data produced the best-performing models. Evidently, both featurization and task-appropriate model design are of continued importance, the latter point reinforced by comparisons to recent, more specialized models. Finally, we show that the shape abstraction learned from database samples extends to samples that are evolving explicitly in time.
•Recognizing different poses of flexible objects can be challenging•We derive benchmark sets from molecular science as prototypical examples•Representation continues to matter in modern machine learning•Flexibility across proteins and flexibility in time are understood and compatible
Many natural objects have intrinsic flexibility, for example, through articulated joints in living beings such as humans. In applications like autonomous vehicles, it is important that a class of object captured through imaging devices in either 2D or 3D is safely identified. Differences caused by motion and flexibility are often confounded by intrinsic differences, as seen, for example, in different plants of the same type of tree. Thus, reliably recognizing such objects is a challenging problem. Our study creates a bridge to this problem scope from molecular science by offering datasets for benchmarking methods trying to solve this recognition task. Molecules are flexible and offer many unequivocal classes that can “look” very similar. We were interested in how well modern machine learning methods perform in this task when they have to rely on spatial information alone. Taking a dataset from molecular science cures some technical issues seen with imaging data, such as differences in scale, resolution, and ambiguous labels. Our research shows that the exact way in which the spatial information is encoded continues to be important, and this holds for both accuracy and transferability. The latter can be thought of as a proxy for the appropriateness and generalizability of the strategy a given model learns. Transferability is the biggest concern in fields where there are limited and often non-extensible amounts of data, such as drug discovery, digital humanities, or financial modeling, and we touch upon the implications of our results for applications of machine learning in such a setting.
Molecular science benchmark sets (FEater) are introduced for machine learning tasks involving flexible object recognition. The impact of differences in featurization and model architectures on both the expected accuracy and the achievable transferability is discussed. In addition, numerical evaluations of the performance of the workflows are provided, and the compatibility of the flexibility observed over time (molecular dynamics) with the heterogeneity found across independent observations (structural databases) is evaluated. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Lead contact |
ISSN: | 2666-3899 2666-3899 |
DOI: | 10.1016/j.patter.2024.101147 |