A universal information theoretic approach to the identification of stopwords

One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopwor...

Full description

Saved in:
Bibliographic Details
Published inNature machine intelligence Vol. 1; no. 12; pp. 606 - 612
Main Authors Gerlach, Martin, Shi, Hanyu, Amaral, Luís A. Nunes
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 01.12.2019
Nature Publishing Group
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopword lists. This approach is problematic because it cannot be readily generalized across knowledge domains or languages. As a result of the difficulty in rigorously defining stopwords, there have been few systematic studies on the effect of stopword removal on algorithm performance, which is reflected in the ongoing debate on whether to keep or remove stopwords. Here we address this challenge by formulating an information theoretic framework that automatically identifies uninformative words in a corpus. We show that our framework not only outperforms other stopword heuristics, but also allows for a substantial reduction of document size in applications of topic modelling. Our findings can be readily generalized to other bag-of-words-type approaches beyond language such as in the statistical analysis of transcriptomics, audio or image corpora. To better extract meaning from natural language, some less informative words can be removed before a model is trained, which is usually done by using manually curated lists of stopwords. A new information theoretic approach can identify uninformative words automatically and more accurately.
ISSN:2522-5839
2522-5839
DOI:10.1038/s42256-019-0112-6