Crawling by Readability Level

The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper...

Full description

Saved in:

Bibliographic Details
Published in	Computational Processing of the Portuguese Language pp. 306 - 318
Main Authors	Filho, Jorge A. Wagner, Wilkens, Rodrigo, Zilio, Leonardo, Idiart, Marco, Villavicencio, Aline
Format	Book Chapter
Language	English
Published	Cham Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Focused crawling Readability assessment Web as a corpus
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www.inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawled WaC classified by readability level.
ISBN:	9783319415512 3319415514
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-319-41552-9_31