Fixed and adaptive landmark sets for finite pseudometric spaces
Topological data analysis (TDA) is an expanding field that leverages principles and tools from algebraic topology to quantify structural features of data sets or transform them into more manageable forms. As its theoretical foundations have been developed, TDA has shown promise in extracting useful...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
19.12.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Topological data analysis (TDA) is an expanding field that leverages
principles and tools from algebraic topology to quantify structural features of
data sets or transform them into more manageable forms. As its theoretical
foundations have been developed, TDA has shown promise in extracting useful
information from high-dimensional, noisy, and complex data such as those used
in biomedicine. To improve efficiency, these techniques may employ landmark
samplers. The heuristic maxmin procedure obtains a roughly even distribution of
sample points by implicitly constructing a cover comprising sets of uniform
radius. However, issues arise with data that vary in density or include points
with multiplicities, as are common in biomedicine. We propose an analogous
procedure, "lastfirst" based on ranked distances, which implies a cover
comprising sets of uniform cardinality. We first rigorously define the
procedure and prove that it obtains landmarks with desired properties. We then
perform benchmark tests and compare its performance to that of maxmin, on
feature detection and class prediction tasks involving simulated and real-world
biomedical data. Lastfirst is more general than maxmin in that it can be
applied to any data on which arbitrary (and not necessarily symmetric) pairwise
distances can be computed. Lastfirst is more computationally costly, but our
implementation scales at the same rate as maxmin. We find that lastfirst
achieves comparable performance on prediction tasks and outperforms maxmin on
homology detection tasks. Where the numerical values of similarity measures are
not meaningful, as in many biomedical contexts, lastfirst sampling may also
improve interpretability. |
---|---|
DOI: | 10.48550/arxiv.2212.09826 |