NATURAL LANGUAGE PROCESSING OF SOCIAL MEDIA TEXT DATA USING BERT AND XGBOOST

Context The growth of text data in social networks requires the development of effective methods for sentiment analysis that can take into account both lexical and contextual dependencies. Traditional approaches to text processing have limitations in understanding semantic relationships between word...

Full description

Saved in:

Bibliographic Details
Published in	Radìoelektronika, informatika, upravlìnnâ no. 2; pp. 154 - 167
Main Authors	Batiuk, T., Dosyn, D.
Format	Journal Article
Language	English Ukrainian
Published	29.06.2025
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Context The growth of text data in social networks requires the development of effective methods for sentiment analysis that can take into account both lexical and contextual dependencies. Traditional approaches to text processing have limitations in understanding semantic relationships between words, which affects the accuracy of classification. The integration of deep neural networks for text vectorization with ensemble machine learning algorithms and methods for interpreting results allows improving the quality of sentiment analysis.Objective. The aim of the study is to develop and evaluate a new approach to text message sentiment classification that combines Sentence-BERT for deep semantic vectorization, XGBoost for high-accuracy classification, SHAP for explaining the contribution of features, sentence embedding similarity for assessing semantic similarity, and λ-regularization to improve the generalization ability of the model. The study is aimed at analyzing the impact of these methods on the quality of classification, identifying the most significant features and optimizing parameters.Method. The study uses Sentence-BERT to transform text data into a vector space with deep semantic connections. XGBoost is used for sentiment classification, which provides high accuracy and stability even on unevenly distributed datasets. The SHAP method is used to explain the contribution of features, which allows us to determine which factors have the greatest impact on the prediction. Additionally, sentence embedding similarity is used to compare texts.Results. The proposed approach demonstrates high efficiency in mood classification tasks. The ROC-AUC value confirms the ability of the model to accurately distinguish between classes of emotional coloring of the text. The use of SHAP ensures the interpretability of the results, allowing us to explain the influence of each feature on the classification. Sentence embedding similarity confirms the efficiency of Sentence-BERT in detecting semanticallysimilar texts, and λ-regularization improves the generalization ability of the model.Conclusions. The study demonstrates scientific novelty through a comprehensive combination of Sentence-BERT, XGBoost, SHAP, sentence embedding similarity, and λ-regularization to improve the accuracy and interpretability of sentiment analysis. The results obtained confirm the effectiveness of the proposed approach, which makes it promising for application in public opinion monitoring, automated content moderation, and personalized recommendation systems. Further research can be aimed at adapting the model to specific domains and improving interpretation methods.
ISSN:	1607-3274 2313-688X
DOI:	10.15588/1607-3274-2025-2-14