Domain-adaptive entity recognition: unveiling the potential of CSER in cybersecurity and beyond

In the dynamic fields of cybersecurity, precise recognition and identification of cybersecurity-related entities in textual data have become crucial. Existing studies on Named Entity Recognition (NER) in the cybersecurity domain often overlook challenges posed by data sparsity and the substantial pr...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of machine learning and cybernetics Vol. 16; no. 5; pp. 2849 - 2867
Main Authors Marjan, Md. Abu, Amagasa, Toshiyuki
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2025
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1868-8071
1868-808X
DOI10.1007/s13042-024-02424-9

Cover

More Information
Summary:In the dynamic fields of cybersecurity, precise recognition and identification of cybersecurity-related entities in textual data have become crucial. Existing studies on Named Entity Recognition (NER) in the cybersecurity domain often overlook challenges posed by data sparsity and the substantial presence of Out-of-Vocabulary (OOV) tokens in Cyber Treat Intelligence (CTI) reports. To tackle these challenges, we introduce the Cybersecurity Entity Recognition (CSER) model—a comprehensive approach crafted to handle CTI data complexities and similar intricacies across other domains. The CSER model integrates output from contextual, semantic, and morphological encoders to form a robust feature vector, capturing nuanced patterns, buzzwords, and structural attributes specific to cybersecurity entities. In particular, we employ various deep-learning approaches to capture morphological and contextual features, while pre-trained embeddings are utilized to capture semantic features. Additionally, Conditional Random Field (CRF) is employed as a sequential decoder, enhancing the effectiveness of cybersecurity entity identification. Extensive experiments on genuine cybersecurity datasets reveal that the proposed CSER model surpasses contemporary state-of-the-art methods, demonstrating superior predictive performance. To validate the effectiveness of this model, experiments are extended to datasets from biomedical and material science domains, providing comprehensive insights into the model’s adaptability across diverse domains. Our research demonstrates that the CSER model excels in domains with frequent OOV tokens, particularly cybersecurity, addressing data sparsity effectively. Its capability to manage a substantial volume of OOV tokens enhances performance where traditional models struggle.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1868-8071
1868-808X
DOI:10.1007/s13042-024-02424-9