Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches

There is a growing interest in quantitative analysis of large corpora among the international relations (IR) scholars, but many of them find it difficult to perform analysis consistently with existing theoretical frameworks using unsupervised machine learning models to further develop the field. To...

Full description

Saved in:

Bibliographic Details
Published in	Social science computer review Vol. 40; no. 2; pp. 346 - 366
Main Authors	Watanabe, Kohei, Zhou, Yuan
Format	Journal Article
Language	English
Published	Los Angeles, CA SAGE Publications 01.04.2022 SAGE PUBLICATIONS, INC
Subjects	Classification Community relations Dictionaries Dirichlet problem Entropy Experiments International relations Machine learning Quantitative analysis Sentences Speeches international relations dictionary making text analysis semisupervised learning United Nations
Online Access	Get full text

Cover

Loading…

More Information
Summary:	There is a growing interest in quantitative analysis of large corpora among the international relations (IR) scholars, but many of them find it difficult to perform analysis consistently with existing theoretical frameworks using unsupervised machine learning models to further develop the field. To solve this problem, we created a set of techniques that utilize a semisupervised model that allows researchers to classify documents into predefined categories efficiently. We propose a dictionary making procedure to avoid inclusion of words that are likely to confuse the model and deteriorate the its classification performance classification accuracy using a new entropy-based diagnostic tool. In our experiments, we classify sentences of the United Nations General Assembly speeches into six predefined categories using the seeded Latent Dirichlet allocation and Newsmap, which were trained with a small “seed word dictionary” that we created following the procedure. The result shows that, while keyword dictionary can only classify 25% of sentences, Newsmap can classify over 60% of them accurately correctly and; its accuracy exceeds 70% when contextual information is taken into consideration by kernel smoothing of topic likelihoods. We argue that once seed word dictionaries are created by the international relations community, semisupervised models would become more useful than unsupervised models for theory-driven text analysis.
ISSN:	0894-4393 1552-8286
DOI:	10.1177/0894439320907027