The Semantic Data Dictionary – An Approach for Describing and Annotating Data

It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are...

Full description

Saved in:
Bibliographic Details
Published inData intelligence Vol. 2; no. 4; pp. 443 - 486
Main Authors Rashid, Sabbir M., McCusker, James P., Pinheiro, Paulo, Bax, Marcello P., Santos, Henrique O., Stingone, Jeanette A., Das, Amar K., McGuinness, Deborah L.
Format Journal Article
LanguageEnglish
Published One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 01.10.2020
MIT Press Journals, The
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large National Institutes of Health (NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.
Bibliography:Fall, 2020
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID: https://orcid.org/0000-0002-2110-6416
ORCID: https://orcid.org/0000-0003-3556-0844
ORCID: https://orcid.org/0000-0001-7037-4567
ORCID: https://orcid.org/0000-0001-8469-4043
ORCID: https://orcid.org/0000-0003-3508-8260
ORCID: https://orcid.org/0000-0003-0503-3031
ORCID: https://orcid.org/0000-0003-1085-6059
ISSN:2641-435X
2641-435X
DOI:10.1162/dint_a_00058