Domain-specific language models and lexicons for tagging
Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We s...
Saved in:
Published in | Journal of biomedical informatics Vol. 38; no. 6; pp. 422 - 430 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
United States
Elsevier Inc
01.12.2005
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1532-0464 1532-0480 |
DOI: | 10.1016/j.jbi.2005.02.009 |