A universal information theoretic approach to the identification of stopwords
One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopwor...
Saved in:
Published in | Nature machine intelligence Vol. 1; no. 12; pp. 606 - 612 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
London
Nature Publishing Group UK
01.12.2019
Nature Publishing Group |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopword lists. This approach is problematic because it cannot be readily generalized across knowledge domains or languages. As a result of the difficulty in rigorously defining stopwords, there have been few systematic studies on the effect of stopword removal on algorithm performance, which is reflected in the ongoing debate on whether to keep or remove stopwords. Here we address this challenge by formulating an information theoretic framework that automatically identifies uninformative words in a corpus. We show that our framework not only outperforms other stopword heuristics, but also allows for a substantial reduction of document size in applications of topic modelling. Our findings can be readily generalized to other bag-of-words-type approaches beyond language such as in the statistical analysis of transcriptomics, audio or image corpora.
To better extract meaning from natural language, some less informative words can be removed before a model is trained, which is usually done by using manually curated lists of stopwords. A new information theoretic approach can identify uninformative words automatically and more accurately. |
---|---|
ISSN: | 2522-5839 2522-5839 |
DOI: | 10.1038/s42256-019-0112-6 |