Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach

Background Studies on Type-2 Diabetes Mellitus (T2DM) have revealed heterogeneous sub-populations in terms of underlying pathologies. However, the identification of sub-populations in epidemiological datasets remains unexplored. We here focus on the detection of T2DM clusters in epidemiological data...

Full description

Saved in:
Bibliographic Details
Published inNutrition & diabetes Vol. 12; no. 1; pp. 27 - 11
Main Authors Bej, Saptarshi, Sarkar, Jit, Biswas, Saikat, Mitra, Pabitra, Chakrabarti, Partha, Wolkenhauer, Olaf
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 27.05.2022
Nature Publishing Group
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Background Studies on Type-2 Diabetes Mellitus (T2DM) have revealed heterogeneous sub-populations in terms of underlying pathologies. However, the identification of sub-populations in epidemiological datasets remains unexplored. We here focus on the detection of T2DM clusters in epidemiological data, specifically analysing the National Family Health Survey-4 (NFHS-4) dataset from India containing a wide spectrum of features, including medical history, dietary and addiction habits, socio-economic and lifestyle patterns of 10,125 T2DM patients. Methods Epidemiological data provide challenges for analysis due to the diverse types of features in it. In this case, applying the state-of-the-art dimension reduction tool UMAP conventionally was found to be ineffective for the NFHS-4 dataset, which contains diverse feature types. We implemented a distributed clustering workflow combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data. Results Our analysis reveals four significant clusters, with two of them comprising mainly of non-obese T2DM patients. These non-obese clusters have lower mean age and majorly comprises of rural residents. Surprisingly, one of the obese clusters had 90% of the T2DM patients practising a non-vegetarian diet though they did not show an increased intake of plant-based protein-rich foods. Conclusions From a methodological perspective, we show that for diverse data types, frequent in epidemiological datasets, feature-type-distributed clustering using UMAP is effective as opposed to the conventional use of the UMAP algorithm. The application of UMAP-based clustering workflow for this type of dataset is novel in itself. Our findings demonstrate the presence of heterogeneity among Indian T2DM patients with regard to socio-demography and dietary patterns. From our analysis, we conclude that the existence of significant non-obese T2DM sub-populations characterized by younger age groups and economic disadvantage raises the need for different screening criteria for T2DM among rural Indian residents.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2044-4052
2044-4052
DOI:10.1038/s41387-022-00206-2