Medical Provider Embeddings for Healthcare Fraud Detection

Advances in data mining and machine learning continue to transform the healthcare industry and provide value to medical professionals and patients. In this study, we address the problem of encoding medical provider types and present four techniques for learning dense, semantic embeddings that captur...

Full description

Saved in:
Bibliographic Details
Published inSN computer science Vol. 2; no. 4; p. 276
Main Authors Johnson, Justin M., Khoshgoftaar, Taghi M.
Format Journal Article
LanguageEnglish
Published Singapore Springer Singapore 01.07.2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Advances in data mining and machine learning continue to transform the healthcare industry and provide value to medical professionals and patients. In this study, we address the problem of encoding medical provider types and present four techniques for learning dense, semantic embeddings that capture provider specialty similarities. The first two methods (GloVe and Med-W2V) use pre-trained word embeddings to convert provider specialty descriptions to phrase embeddings. Next, HcpsVec and RxVec embeddings are constructed from publicly available big data using specialty-procedure and specialty-drug occurrence matrices, respectively. We evaluate the learned provider type embeddings on two real-world medicare fraud classification problems using logistic regression (LR), random forest (RF), gradient boosted tree (GBT), and multilayer perceptron (MLP) learners. Through repetition, statistical analysis, and feature importance measures, we confirm that semantic embeddings for provider types significantly improve fraud classification results. Finally, t-SNE visualizations are used to show that the learned provider type embeddings capture meaningful specialty characteristics and provider type similarities. Our primary contributions are two novel methods for encoding medical specialties using procedure-level statistics and the evaluation of four encoding techniques on two large-scale healthcare fraud classification tasks. Since all data sources are publicly available, these encoding techniques can be readily adopted and applied in future machine learning applications in the healthcare industry.
ISSN:2662-995X
2661-8907
DOI:10.1007/s42979-021-00656-y