Dataset Reuse: Toward Translating Principles to Practice
The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper,...
Saved in:
Published in | Patterns (New York, N.Y.) Vol. 1; no. 8; p. 100136 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Inc
13.11.2020
Elsevier |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse.
•A compilation of reusability features of datasets from literature•A corpus of 1.47 million datasets from 65,537 repositories source from GitHub•A case study on GitHub using a five-step approach to understand projected data reuse•A machine learning model that helps to predict dataset reuse in the case of GitHub
The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse.
There is plenty of advice on how to make a dataset easier to reuse, including technical standards, legal frameworks, and guidelines. This paper begins to address the gap between this advice and practice. To do so, a compilation of reuse features from literature is presented. To understand how they look like in data projects, we carried out a case study of datasets published and shared on GitHub, a large online platform to share code and data. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Lead Contact |
ISSN: | 2666-3899 2666-3899 |
DOI: | 10.1016/j.patter.2020.100136 |