Crawling by Readability Level
The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper...
Saved in:
Published in | Computational Processing of the Portuguese Language pp. 306 - 318 |
---|---|
Main Authors | , , , , |
Format | Book Chapter |
Language | English |
Published |
Cham
Springer International Publishing
|
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www.inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawled WaC classified by readability level. |
---|---|
ISBN: | 9783319415512 3319415514 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-319-41552-9_31 |