A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far ei...

Full description

Saved in:

Bibliographic Details
Published in	Entropy (Basel, Switzerland) Vol. 22; no. 1; p. 126
Main Authors	Gerlach, Martin, Font-Clos, Francesc
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 20.01.2020 MDPI
Subjects	Collaboration Corpus analysis Corpus linguistics Datasets Information retrieval jensen–shannon divergence Language Linguistics Machine learning Metadata Natural language Natural language processing project gutenberg Quantitative analysis quantitative linguistics Reproducibility Statistical analysis Tagging Words (language) Project Gutenberg natural language processing Jensen–Shannon divergence quantitative linguistics reproducibility
Online Access	Get full text
ISSN	1099-4300 1099-4300
DOI	10.3390/e22010126

Cover

Loading…

More Information
Summary:	The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1099-4300 1099-4300
DOI:	10.3390/e22010126