Using natural language processing and machine learning to replace human content coders

Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely labor-intensive and subject to human coder errors. Applying natural language processing (NLP) technique...

Full description

Saved in:
Bibliographic Details
Published inPsychological methods Vol. 29; no. 6; p. 1148
Main Authors Wang, Yilei, Tian, Jingyuan, Yazar, Yagizhan, Ones, Deniz S, Landers, Richard N
Format Journal Article
LanguageEnglish
Published United States 01.12.2024
Subjects
Online AccessGet more information

Cover

Loading…
Abstract Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely labor-intensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
AbstractList Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely labor-intensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Author Landers, Richard N
Tian, Jingyuan
Yazar, Yagizhan
Wang, Yilei
Ones, Deniz S
Author_xml – sequence: 1
  givenname: Yilei
  orcidid: 0000-0002-3082-3038
  surname: Wang
  fullname: Wang, Yilei
  organization: Department of Psychology, University of Minnesota at Twin Cities
– sequence: 2
  givenname: Jingyuan
  orcidid: 0000-0001-5012-0797
  surname: Tian
  fullname: Tian, Jingyuan
  organization: Department of Psychology, University of Minnesota at Twin Cities
– sequence: 3
  givenname: Yagizhan
  orcidid: 0000-0001-6040-3969
  surname: Yazar
  fullname: Yazar, Yagizhan
  organization: Department of Psychology, University of Minnesota at Twin Cities
– sequence: 4
  givenname: Deniz S
  orcidid: 0000-0003-1739-8951
  surname: Ones
  fullname: Ones, Deniz S
  organization: Department of Psychology, University of Minnesota at Twin Cities
– sequence: 5
  givenname: Richard N
  orcidid: 0000-0001-5611-2923
  surname: Landers
  fullname: Landers, Richard N
  organization: Department of Psychology, University of Minnesota at Twin Cities
BackLink https://www.ncbi.nlm.nih.gov/pubmed/36006759$$D View this record in MEDLINE/PubMed
BookMark eNo1j81KxDAUhYMozo9ufADJC1Rzk7RJlzLoKAy4cdwON-lNp9KmJW0Xvr3j39l8cD44cFbsPPaRGLsBcQdCmfuOJnFKDvaMLaFUZQa6UAu2GscPIUArqy_ZQhVCFCYvl-x9Pzax5hGnOWHLW4z1jDXxIfWexh-HseId-mMTibeEKX6XU88TDS164se5w8h9HyeK04kVpfGKXQRsR7r-45rtnx7fNs_Z7nX7snnYZahyOWUuoCbhrbKlLIIzzpAmBQG8taHQPoBRIPIKdChzBUbkTguwpnKyDFKhXLPb391hdh1VhyE1HabPw_9B-QWBgFJd
CitedBy_id crossref_primary_10_1177_1932202X231211633
crossref_primary_10_1007_s10648_024_09862_5
crossref_primary_10_1016_j_cresp_2023_100164
crossref_primary_10_1177_10944281241264027
crossref_primary_10_1108_BPMJ_11_2023_0876
crossref_primary_10_1177_25152459241296401
crossref_primary_10_1177_20413866241245314
crossref_primary_10_3758_s13428_024_02381_9
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
DOI 10.1037/met0000518
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
DatabaseTitleList MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Psychology
EISSN 1939-1463
ExternalDocumentID 36006759
Genre Journal Article
GroupedDBID ---
--Z
-~X
.-4
07C
0R~
123
29P
354
53G
5VS
7RZ
ABIVO
ABNCP
ACHQT
ACPQG
AEHFB
ALMA_UNASSIGNED_HOLDINGS
AWKKM
AZXWR
CGNQK
CGR
CS3
CUY
CVF
ECM
EIF
EPA
F5P
FTD
HVGLF
HZ~
ISO
LW5
NPM
O9-
OHT
OPA
OVD
P2P
ROL
SES
SPA
TEORI
TN5
UHS
XJT
YNT
ZPI
ID FETCH-LOGICAL-a352t-bfa4e0c838926fb7b7e4e31f1c88f64cf173105d14f9531705b40187db29f23a2
IngestDate Thu Jan 02 22:23:06 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 6
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-a352t-bfa4e0c838926fb7b7e4e31f1c88f64cf173105d14f9531705b40187db29f23a2
ORCID 0000-0003-1739-8951
0000-0001-5611-2923
0000-0001-5012-0797
0000-0001-6040-3969
0000-0002-3082-3038
PMID 36006759
ParticipantIDs pubmed_primary_36006759
PublicationCentury 2000
PublicationDate 2024-Dec
PublicationDateYYYYMMDD 2024-12-01
PublicationDate_xml – month: 12
  year: 2024
  text: 2024-Dec
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Psychological methods
PublicationTitleAlternate Psychol Methods
PublicationYear 2024
SSID ssj0014384
Score 2.526594
Snippet Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical...
SourceID pubmed
SourceType Index Database
StartPage 1148
SubjectTerms Humans
Machine Learning
Natural Language Processing
Psychology - methods
Qualitative Research
Reproducibility of Results
Title Using natural language processing and machine learning to replace human content coders
URI https://www.ncbi.nlm.nih.gov/pubmed/36006759
Volume 29
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3JTsMwELUKSKgXxL4jH7ihQBI7iXNELEJIwKUsPSE7tiEHSoXKgX49461p2QRcospO08bvZfI8Hs8gtKvLVGaM0ChWUkc0Z0kkChZHQsqKMklyZVMKXVzmZ9f0_C67a7X6Y1FLrwOxXw2_3FfyH1ShDXA1u2T_gOzootAAnwFfOALCcPwVxm693-bmhJEOrse9vgv-D9sPn2y8pAoFIh6M3HxRNhjLl-gz8eomJsDsb_erO16vTtpHV256pMJvva-5C5alHrkAaudSPYdfenttuNflQxfL3eUP9fCx6bjyxQKOVa8eek-sd0OkdCykQznTWZIyArtLxm2r92bUnwylmYZ9acFdDgC4n9jKy4mTYPT7TxZLkpvXrMsl_nPvh2zaoWsKTcG8whRKNd4dv-pECaMhhS0pDpo_0Uaz4Ysfph9WhnTm0ZyfP-BDR4YF1FK9RdQewfS2hG4sK7BnBQ6swA0rMLACe1bgwAo8eMaeFdiyAntWYMeKZXR9etI5Oot89YyIg6geREJzquKKgSJNcy0KUSiqSKKTijGd00onBUj7TCZUl2CIizgT1FRolCItdUp4uoKme889tYYw17EUXJUSxDLNy0xUCZc8AbEK-pCydB2tujG577sUKfdhtDa-7dlE7YZGW2hGwzOptkHgDcSOReUdWkRT2w
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Using+natural+language+processing+and+machine+learning+to+replace+human+content+coders&rft.jtitle=Psychological+methods&rft.au=Wang%2C+Yilei&rft.au=Tian%2C+Jingyuan&rft.au=Yazar%2C+Yagizhan&rft.au=Ones%2C+Deniz+S&rft.date=2024-12-01&rft.eissn=1939-1463&rft.volume=29&rft.issue=6&rft.spage=1148&rft_id=info:doi/10.1037%2Fmet0000518&rft_id=info%3Apmid%2F36006759&rft_id=info%3Apmid%2F36006759&rft.externalDocID=36006759