Double-weight LDA extracting keywords for financial fraud detection system

The impact of financial fraud is widespread, from everyday life to the financial industry, and it reduces industry confidence and destabilizes the country’s economy. Therefore, it is important to develop an intelligent financial fraud detection system for early warning and prevention. This study pro...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 83; no. 17; pp. 50757 - 50781
Main Authors	Cheng, Ching-Hsue, Cai, Wen-Hong
Format	Journal Article
Language	English
Published	New York Springer US 01.05.2024 Springer Nature B.V
Subjects	Computer Communication Networks Computer Science Data Structures and Information Theory Datasets Dirichlet problem Early warning systems Electronic mail Fraud Fraud prevention Graphical representations Keywords Multimedia Information Systems Performance evaluation Special Purpose and Application-Based Systems Latent Dirichlet allocation Natural language processing Imbalanced classes Visual information representation Financial fraud detection model
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The impact of financial fraud is widespread, from everyday life to the financial industry, and it reduces industry confidence and destabilizes the country’s economy. Therefore, it is important to develop an intelligent financial fraud detection system for early warning and prevention. This study proposes a double-weight latent Dirichlet allocation (DW-LDA) to extract the keywords from financial fraud data, and then we use five intelligent classifiers to build an intelligent text fraud detection model. In addition, the financial fraud dataset usually contains more non-fraud cases than fraud cases, which is an imbalanced dataset; hence, this study uses a synthesized minority oversampling technique (SMOTE) and random undersampling to handle imbalanced datasets. In verification, this study collected the Enron email and MD&A datasets to compare the performances of the related topic models and weighted LDA (TFIDF+LDA and PMI + LDA) with the proposed DW-LDA after SMOTE handling. In evaluating model performance, we use accuracy, recall, precision, F-score, and AUC as evaluation metrics, and the results show that the proposed DW-LDA (TFIDF+PMI + LDA) has a better performance than the listing topic models. For visual information representation, we use visual graphs to show the important results, such as the word cloud of the fraudulent email and keywords. The research results and the built intelligent text fraud detection model can be provided to investors and stakeholders for reference.
ISSN:	1573-7721 1380-7501 1573-7721
DOI:	10.1007/s11042-023-17334-1