Degree and centrality-based approaches in network-based variable selection: Insights from the Singapore Longitudinal Aging Study

We describe a network-based method to obtain a subset of representative variables from clinical data of subjects of the second Singapore Longitudinal Aging Study (SLAS-2), while preserving to a good extent the predictive performance of the full set with regards to a multi-faceted index of successful...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 14; no. 7; p. e0219186
Main Authors	Valenzuela, Jesus Felix Bayta, Monterola, Christopher, Tong, Victor Joo Chuan, Fülöp, Tamàs, Ng, Tze Pin, Larbi, Anis
Format	Journal Article
Language	English
Published	United States Public Library of Science 18.07.2019 Public Library of Science (PLoS)
Subjects	Age Aging Aging (Biology) Aging - physiology Algorithms Analysis Bioinformatics Biology Biology and Life Sciences Cardiovascular disease Clusters Cognitive ability Computer and Information Sciences Datasets Decision making Domains Entrepreneurship Frailty Genes Health aspects Health policy Health surveys Humans Immunology Indexing Laboratories Learning algorithms Longitudinal Studies Machine learning Medicine Medicine and Health Sciences Mental health Methods Middle Aged Model accuracy Nodes People and Places Performance prediction Population ROC Curve Singapore Social networks Variables Singapore Canada Philippines Quebec Canada
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We describe a network-based method to obtain a subset of representative variables from clinical data of subjects of the second Singapore Longitudinal Aging Study (SLAS-2), while preserving to a good extent the predictive performance of the full set with regards to a multi-faceted index of successful aging, SAGE. To examine differences in predictive performance of high-degree nodes ("hubs") and high-centrality ones ("cores"), we implement four subsetting strategies (two degree-based, two centrality-based) and obtain four surrogate sets of variables, which we use as input features for machine learning models to predict the SAGE index of subjects. All four models have variables belonging to the physical, cardiovascular, cognitive and immunological domains among their fifteen most important predictors. A fifth domain (leisure-time activities, LTA) is also present in some form. From a comparison of the surrogate sets' size and predictive performance, a centrality-based approach (selection of the most central variable-nodes within each cluster) yielded the smallest-sized surrogate set, while having high prediction accuracy (measured by its model's area-under-curve, AUC) in comparison to its analogous degree-based strategy (selection of the highest-degree nodes per cluster). Inclusion of the next most-central variables yielded negligible changes in predictive performance while more than doubling the surrogate set size. The centrality-based approach thus yields a surrogate set which offers a good balance between number of variables and prediction performance, and can act as a representative subset of the SLAS-2 clinical dataset.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist. Equal senior authors for this work.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0219186