Multilingual SMS Spam Detection using BERT and LSTM

With the increased use of digital communication, the battle against spam has gotten more fierce. It highlights how important spam identification is to systems like social media moderation, email filtering, and comment spam avoidance. Machine learning algorithms must always be enhanced in order to st...

Full description

Saved in:
Bibliographic Details
Published in2024 International Conference on Innovations and Challenges in Emerging Technologies (ICICET) pp. 1 - 6
Main Authors Nayak, Amlan, Kumari, Rina, Pal, Debapam, Jana, Sudatta, Bhardwaj, Aniket, Dasude, Pratim Mangaldas
Format Conference Proceeding
LanguageEnglish
Published IEEE 07.06.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:With the increased use of digital communication, the battle against spam has gotten more fierce. It highlights how important spam identification is to systems like social media moderation, email filtering, and comment spam avoidance. Machine learning algorithms must always be enhanced in order to stay ahead of newly developed spamming techniques and provide a safe online environment. This study uses a Kaggle dataset that was originally meant for spam detection. To conduct multilingual spam detection in French, German, and English, the data required some transformations and transitions. Thorough preparation, such as stop-word removal, tokenization, and category classification according to language, improves the dataset's flexibility for investigating intricate spam patterns in multilingual settings. To achieve the desired outcomes, a variety of machine learning algorithms like Multinomial NB, XGBoost, LSTM and BERT were appropriately applied. Among the models tested, Multinomial Naive Bayes exhibited superior performance with a remarkable combined accuracy of 98.1%, positioning it as a reliable choice for spam detection. With rigorous data cleaning, exploration, and model evaluation as a foundation, the work offers useful insights for spam detection on a variety of language datasets.
DOI:10.1109/ICICET59348.2024.10616322