Generative artificial intelligence and machine learning methods to screen social media content

Social media research is confronted by the expansive and constantly evolving nature of social media data. Hashtags and keywords are frequently used to identify content related to a specific topic, but these search strategies often result in large numbers of irrelevant results. Therefore, methods are...

Full description

Saved in:

Bibliographic Details
Published in	PeerJ. Computer science Vol. 11; p. e2710
Main Authors	Sharp, Kellen, Ouellette, Rachel R., Singh, Rujula Singh Rajendra, DeVito, Elise E., Kamdar, Neil, de la Noval, Amanda, Murthy, Dhiraj, Kong, Grace
Format	Journal Article
Language	English
Published	United States PeerJ. Ltd 14.03.2025 PeerJ Inc
Subjects	Analysis Artificial Intelligence ChatGPT Computational linguistics Computer Vision Data Mining and Machine Learning e-cigarette Electronic cigarettes ENDS Generative AI Language processing Machine learning Methods Multimedia Natural language interfaces Network Science and Online Social Networks Social media Social networks United States Pregnancy Computer vision Vaping Generative AI ChatGPT Social media Machine learning ENDS e-cigarette TikTok
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Social media research is confronted by the expansive and constantly evolving nature of social media data. Hashtags and keywords are frequently used to identify content related to a specific topic, but these search strategies often result in large numbers of irrelevant results. Therefore, methods are needed to quickly screen social media content based on a specific research question. The primary objective of this article is to present generative artificial intelligence (AI; ., ChatGPT) and machine learning methods to screen content from social media platforms. As a proof of concept, we apply these methods to identify TikTok content related to e-cigarette use during pregnancy. We searched TikTok for pregnancy and vaping content using 70 hashtag pairs related to "pregnancy" and "vaping" ( ., #pregnancytok and #ecigarette) to obtain 11,673 distinct posts. We extracted post videos, descriptions, and metadata using Zeeschuimer and PykTok library. To enhance textual analysis, we employed automatic speech recognition the Whisper system to transcribe verbal content from each video. Next, we used the OpenCV library to extract frames from the videos, followed by object and text detection analysis using Oracle Cloud Vision. Finally, we merged all text data to create a consolidated dataset and entered this dataset into ChatGPT-4 to determine which posts are related to vaping and pregnancy. To refine the ChatGPT prompt used to screen for content, a human coder cross-checked ChatGPT-4's outputs for 10 out of every 100 metadata entries, with errors used to inform the final prompt. The final prompt was evaluated through human review, confirming for posts that contain "pregnancy" and "vape" content, comparing determinations to those made by ChatGPT. Our results indicated ChatGPT-4 classified 44.86% of the videos as exclusively related to pregnancy, 36.91% to vaping, and 8.91% as containing both topics. A human reviewer confirmed for vaping and pregnancy content in 45.38% of the TikTok posts identified by ChatGPT as containing relevant content. Human review of 10% of the posts screened out by ChatGPT identified a 99.06% agreement rate for excluded posts. ChatGPT has mixed capacity to screen social media content that has been converted into text data using machine learning techniques such as object detection. ChatGPT's sensitivity was found to be lower than a human coder in the current case example but has demonstrated power for screening out irrelevant content and can be used as an initial pass at screening content. Future studies should explore ways to enhance ChatGPT's sensitivity.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2376-5992 2376-5992
DOI:	10.7717/peerj-cs.2710