DIR - A semantic information resource for healthcare datasets

It is important for data scientists to have a good understanding of the availability of relevant datasets as well as the content, structure, and existing analyses of these datasets. While a number of efforts are underway to integrate the large amount and variety of datasets, there is a lack of infor...

Full description

Saved in:
Bibliographic Details
Published in2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) pp. 805 - 810
Main Authors Jingyi Shi, Mingna Zheng, Lixia Yao, Yaorong Ge
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:It is important for data scientists to have a good understanding of the availability of relevant datasets as well as the content, structure, and existing analyses of these datasets. While a number of efforts are underway to integrate the large amount and variety of datasets, there is a lack of information resources that focus on specific learning needs of some targeted audiences. To address this gap, we have been developing a semantic Dataset Information Resource (DIR) framework to specifically address the challenges of entry-level data scientists in learning to identify, understand, and analyze major datasets with an initial focus on healthcare. The DIR does not contain actual data from the datasets but aims to provide comprehensive knowledge about the datasets and their analyses. The framework leverages Semantic Web technologies and the W3C Dataset Description Standard for knowledge integration and representation and includes natural language processing (NLP)-based methods to enable knowledge extraction and question answering. The prototype DIR implementation includes four major components-dataset metadata and related knowledge, search modules, question answering for frequently-asked questions, and blogs. And the DIR currently includes information on three commonly-used large and complex healthcare datasets: HCUP, MarketScan, and MIMIC. Initial usage evaluation based on health informatics students is encouraging. Further development is underway.
DOI:10.1109/BIBM.2017.8217758