Curated Email-Based Code Reviews Datasets

Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through ema...

Full description

Saved in:
Bibliographic Details
Published in2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR) pp. 294 - 298
Main Authors Liang, Mingzhao, Charoenwet, Wachiraphan, Thongtanunam, Patanamon
Format Conference Proceeding
LanguageEnglish
Published ACM 15.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patch-work data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.
ISSN:2574-3864