A universal information theoretic approach to the identification of stopwords

One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopwor...

Full description

Saved in:

Bibliographic Details
Published in	Nature machine intelligence Vol. 1; no. 12; pp. 606 - 612
Main Authors	Gerlach, Martin, Shi, Hanyu, Amaral, Luís A. Nunes
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 01.12.2019 Nature Publishing Group
Subjects	4014/4009 639/705/1041 639/705/117 639/766/259 Algorithms Engineering Entropy Information retrieval Information theory Natural language processing Statistical analysis Words (language)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopword lists. This approach is problematic because it cannot be readily generalized across knowledge domains or languages. As a result of the difficulty in rigorously defining stopwords, there have been few systematic studies on the effect of stopword removal on algorithm performance, which is reflected in the ongoing debate on whether to keep or remove stopwords. Here we address this challenge by formulating an information theoretic framework that automatically identifies uninformative words in a corpus. We show that our framework not only outperforms other stopword heuristics, but also allows for a substantial reduction of document size in applications of topic modelling. Our findings can be readily generalized to other bag-of-words-type approaches beyond language such as in the statistical analysis of transcriptomics, audio or image corpora. To better extract meaning from natural language, some less informative words can be removed before a model is trained, which is usually done by using manually curated lists of stopwords. A new information theoretic approach can identify uninformative words automatically and more accurately.
ISSN:	2522-5839 2522-5839
DOI:	10.1038/s42256-019-0112-6