A new paradigm for high‐dimensional data: Distance‐based semiparametric feature aggregation framework via between‐subject attributes

This article proposes a distance‐based framework incentivized by the paradigm shift toward feature aggregation for high‐dimensional data, which does not rely on the sparse‐feature assumption or the permutation‐based inference. Focusing on distance‐based outcomes that preserve information without tru...

Full description

Saved in:
Bibliographic Details
Published inScandinavian journal of statistics Vol. 51; no. 2; pp. 672 - 696
Main Authors Liu, Jinyuan, Zhang, Xinlian, Lin, Tuo, Chen, Ruohui, Zhong, Yuan, Chen, Tian, Wu, Tsungchin, Liu, Chenyu, Huang, Anna, Nguyen, Tanya T., Lee, Ellen E., Jeste, Dilip V., Tu, Xin M.
Format Journal Article
LanguageEnglish
Published England Blackwell Publishing Ltd 01.06.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This article proposes a distance‐based framework incentivized by the paradigm shift toward feature aggregation for high‐dimensional data, which does not rely on the sparse‐feature assumption or the permutation‐based inference. Focusing on distance‐based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high‐dimensional variables using pairwise outcomes of between‐subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U‐statistics‐based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root‐n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
AUTHOR CONTRIBUTIONS
JL: methodology, formal analysis, and writing - original draft. XZ: methodology, supervision, writing – review, and editing. TL, RC: conceptualization, writing – review, and editing. YZ, TC, TW, CL: software and visualization. AH, TN, EL, DJ: data curation, resources, writing – review, and editing. XT: methodology, resources, supervision, writing – review, and editing. All authors reviewed and approved the final manuscript.
ISSN:0303-6898
1467-9469
DOI:10.1111/sjos.12695