Dataset Reuse: Toward Translating Principles to Practice

The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper,...

Full description

Saved in:

Bibliographic Details
Published in	Patterns (New York, N.Y.) Vol. 1; no. 8; p. 100136
Main Authors	Koesten, Laura, Vougiouklis, Pavlos, Simperl, Elena, Groth, Paul
Format	Journal Article
Language	English
Published	Elsevier Inc 13.11.2020 Elsevier
Subjects	data portals dataset reuse human-data interaction neural networks reuse prediction reuse prediction DSML 2: Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem human-data interaction data portals dataset reuse neural networks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. •A compilation of reusability features of datasets from literature•A corpus of 1.47 million datasets from 65,537 repositories source from GitHub•A case study on GitHub using a five-step approach to understand projected data reuse•A machine learning model that helps to predict dataset reuse in the case of GitHub The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. There is plenty of advice on how to make a dataset easier to reuse, including technical standards, legal frameworks, and guidelines. This paper begins to address the gap between this advice and practice. To do so, a compilation of reuse features from literature is presented. To understand how they look like in data projects, we carried out a case study of datasets published and shared on GitHub, a large online platform to share code and data.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Lead Contact
ISSN:	2666-3899 2666-3899
DOI:	10.1016/j.patter.2020.100136