How to prepare data for the automatic classification of politically related beliefs expressed on Twitter? The consequences of researchers’ decisions on the number of coders, the algorithm learning procedure, and the pre-processing steps on the performance of supervised models

Due to the recent advances in natural language processing, social scientists use automatic text classification methods more and more frequently. The article raises the question about how researchers’ subjective decisions affect the performance of supervised deep learning models. The aim is to delive...

Full description

Saved in:

Bibliographic Details
Published in	Quality & quantity Vol. 57; no. 1; pp. 301 - 321
Main Author	Matuszewski, Paweł
Format	Journal Article
Language	English
Published	Dordrecht Springer Netherlands 01.02.2023 Springer Springer Nature B.V
Subjects	Accuracy Algorithms Automatic classification Big Data Classification Cognitive style Computational linguistics Current events Datasets Decisions Deep learning Digital archives Language Language processing Machine learning Mathematical functions Methodology of the Social Sciences Methods Natural language interfaces Natural language processing Opinion leaders Political activity Political aspects Political attitudes Political factors Regression analysis Research methodology Researcher subject relations Researchers Science Scientists Social networks Social research Social science research Social Sciences Social scientists Text analysis Text categorization Deep learning Natural language processing Big data Text classification Content analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Due to the recent advances in natural language processing, social scientists use automatic text classification methods more and more frequently. The article raises the question about how researchers’ subjective decisions affect the performance of supervised deep learning models. The aim is to deliver practical advice for researchers concerning: (1) whether it is more efficient to monitor coders’ work to ensure a high quality training dataset or have every document coded once and obtain a larger dataset instead; (2) whether lemmatisation improves model performance; (3) if it is better to apply passive learning or active learning approaches; and (4) if the answers are dependent on the models’ classification tasks. The models were trained to detect if a tweet is about current affairs or political issues, the tweet’s subject matter and the tweet author’s stance on this. The study uses a sample of 200,000 manually coded tweets published by Polish political opinion leaders in 2019. The consequences of decisions under different conditions were checked by simulating 52,800 results using the fastText algorithm (DV: F1-score). Linear regression analysis suggests that the researchers’ choices not only strongly affect model performance but may also lead, in the worst-case scenario, to a waste of funds.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0033-5177 1573-7845
DOI:	10.1007/s11135-022-01372-2