Domain-adaptive entity recognition: unveiling the potential of CSER in cybersecurity and beyond
In the dynamic fields of cybersecurity, precise recognition and identification of cybersecurity-related entities in textual data have become crucial. Existing studies on Named Entity Recognition (NER) in the cybersecurity domain often overlook challenges posed by data sparsity and the substantial pr...
Saved in:
Published in | International journal of machine learning and cybernetics Vol. 16; no. 5; pp. 2849 - 2867 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.06.2025
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
ISSN | 1868-8071 1868-808X |
DOI | 10.1007/s13042-024-02424-9 |
Cover
Summary: | In the dynamic fields of cybersecurity, precise recognition and identification of cybersecurity-related entities in textual data have become crucial. Existing studies on Named Entity Recognition (NER) in the cybersecurity domain often overlook challenges posed by data sparsity and the substantial presence of Out-of-Vocabulary (OOV) tokens in Cyber Treat Intelligence (CTI) reports. To tackle these challenges, we introduce the Cybersecurity Entity Recognition (CSER) model—a comprehensive approach crafted to handle CTI data complexities and similar intricacies across other domains. The CSER model integrates output from contextual, semantic, and morphological encoders to form a robust feature vector, capturing nuanced patterns, buzzwords, and structural attributes specific to cybersecurity entities. In particular, we employ various deep-learning approaches to capture morphological and contextual features, while pre-trained embeddings are utilized to capture semantic features. Additionally, Conditional Random Field (CRF) is employed as a sequential decoder, enhancing the effectiveness of cybersecurity entity identification. Extensive experiments on genuine cybersecurity datasets reveal that the proposed CSER model surpasses contemporary state-of-the-art methods, demonstrating superior predictive performance. To validate the effectiveness of this model, experiments are extended to datasets from biomedical and material science domains, providing comprehensive insights into the model’s adaptability across diverse domains. Our research demonstrates that the CSER model excels in domains with frequent OOV tokens, particularly cybersecurity, addressing data sparsity effectively. Its capability to manage a substantial volume of OOV tokens enhances performance where traditional models struggle. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1868-8071 1868-808X |
DOI: | 10.1007/s13042-024-02424-9 |