The Semantic Data Dictionary – An Approach for Describing and Annotating Data
It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are...
Saved in:
Published in | Data intelligence Vol. 2; no. 4; pp. 443 - 486 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
One Rogers Street, Cambridge, MA 02142-1209, USA
MIT Press
01.10.2020
MIT Press Journals, The |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | It is common practice for data providers to include text descriptions for each
column when publishing data sets in the form of data dictionaries. While these
documents are useful in helping an end-user properly interpret the meaning of a
column in a data set, existing data dictionaries typically are not
machine-readable and do not follow a common specification standard. We introduce
the Semantic Data Dictionary, a specification that formalizes the assignment of
a semantic representation of data, enabling standardization and harmonization
across diverse data sets. In this paper, we present our Semantic Data Dictionary
work in the context of our work with biomedical data; however, the approach can
and has been used in a wide range of domains. The rendition of data in this form
helps promote improved discovery, interoperability, reuse, traceability, and
reproducibility. We present the associated research and describe how the
Semantic Data Dictionary can help address existing limitations in the related
literature. We discuss our approach, present an example by annotating portions
of the publicly available National Health and Nutrition Examination Survey data
set, present modeling challenges, and describe the use of this approach in
sponsored research, including our work on a large National Institutes of Health
(NIH)-funded exposure and health data portal and in the RPI-IBM collaborative
Health Empowerment by Analytics, Learning, and Semantics project. We evaluate
this work in comparison with traditional data dictionaries, mapping languages,
and data integration tools. |
---|---|
Bibliography: | Fall, 2020 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 ORCID: https://orcid.org/0000-0002-2110-6416 ORCID: https://orcid.org/0000-0003-3556-0844 ORCID: https://orcid.org/0000-0001-7037-4567 ORCID: https://orcid.org/0000-0001-8469-4043 ORCID: https://orcid.org/0000-0003-3508-8260 ORCID: https://orcid.org/0000-0003-0503-3031 ORCID: https://orcid.org/0000-0003-1085-6059 |
ISSN: | 2641-435X 2641-435X |
DOI: | 10.1162/dint_a_00058 |