Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project

Due to the heterogeneity of existing European sources of observational healthcare data, data source-tailored choices are needed to execute multi-data source, multi-national epidemiological studies. This makes transparent documentation paramount. In this proof-of-concept study, a novel standard data...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 11; no. 8; p. e0160648
Main Authors	Roberto, Giuseppe, Leal, Ingrid, Sattar, Naveed, Loomis, A. Katrina, Avillach, Paul, Egger, Peter, van Wijngaarden, Rients, Ansell, David, Reisberg, Sulev, Tammesoo, Mari-Liis, Alavere, Helene, Pasqua, Alessandro, Pedersen, Lars, Cunningham, James, Tramontan, Lara, Mayer, Miguel A., Herings, Ron, Coloma, Preciosa, Lapi, Francesco, Sturkenboom, Miriam, van der Lei, Johan, Schuemie, Martijn J., Rijnbeek, Peter, Gini, Rosa
Format	Journal Article
Language	English
Published	United States Public Library of Science 31.08.2016 Public Library of Science (PLoS)
Subjects	Age Algorithms Biology and Life Sciences Building components Causes of Chronic illnesses Collaboration Data Mining - methods Data sources Databases, Factual Derivation Diabetes Diabetes mellitus Diabetes mellitus (non-insulin dependent) Diabetes Mellitus, Type 2 - epidemiology Diabetis Diagnostic systems Documentation Drugs Electronic health records Epidemiology Europe - epidemiology Female Genomes Health care Health informatics Heart Heterogeneity Hospitals Humans Information systems Male Medical diagnosis Medical informatics Medical records Medical research Medical screening Medicine and Health Sciences Mortality Physical Sciences Physicians Primary care Protocols clínics Research and Analysis Methods Source studies Standard data Type 2 diabetes United States New York Netherlands United Kingdom > UK United States > US Denmark Italy Aarhus Denmark Estonia Spain
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Due to the heterogeneity of existing European sources of observational healthcare data, data source-tailored choices are needed to execute multi-data source, multi-national epidemiological studies. This makes transparent documentation paramount. In this proof-of-concept study, a novel standard data derivation procedure was tested in a set of heterogeneous data sources. Identification of subjects with type 2 diabetes (T2DM) was the test case. We included three primary care data sources (PCDs), three record linkage of administrative and/or registry data sources (RLDs), one hospital and one biobank. Overall, data from 12 million subjects from six European countries were extracted. Based on a shared event definition, sixteeen standard algorithms (components) useful to identify T2DM cases were generated through a top-down/bottom-up iterative approach. Each component was based on one single data domain among diagnoses, drugs, diagnostic test utilization and laboratory results. Diagnoses-based components were subclassified considering the healthcare setting (primary, secondary, inpatient care). The Unified Medical Language System was used for semantic harmonization within data domains. Individual components were extracted and proportion of population identified was compared across data sources. Drug-based components performed similarly in RLDs and PCDs, unlike diagnoses-based components. Using components as building blocks, logical combinations with AND, OR, AND NOT were tested and local experts recommended their preferred data source-tailored combination. The population identified per data sources by resulting algorithms varied from 3.5% to 15.7%, however, age-specific results were fairly comparable. The impact of individual components was assessed: diagnoses-based components identified the majority of cases in PCDs (93-100%), while drug-based components were the main contributors in RLDs (81-100%). The proposed data derivation procedure allowed the generation of data source-tailored case-finding algorithms in a standardized fashion, facilitated transparent documentation of the process and benchmarking of data sources, and provided bases for interpretation of possible inter-data source inconsistency of findings in future studies.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Conceptualization: RG GR IL PR MJS. Data curation: GR RG PR. Formal analysis: RG GR. Investigation: RG IL PA RvW DA SR AP LP LT MAM PC PR. Methodology: RG GR IL PR. Software: RG PR. Supervision: RG. Visualization: GR RG. Writing – original draft: GR RG IL. Writing – review & editing: GR IL NS AKL PA PE RvW DA SR M-LT HA AP LP JC LT MAM RH PC FL MS JvdL MJS PR RG. Competing Interests: The authors AKL, PE, DA and MJS are employed by Pfizer, GlaxoSmithKline, Cegedim and Janssen that are commercial companies. However, this did not have any influence on the reporting or discussion of the results presented in this manuscript and does not alter our adherence to PLOS ONE policies on sharing data and materials.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0160648