ADVANCING MEDICARE FRAUD DETECTION VIA MACHINE LEARNING AND SMOTE-ENN FOR IMBALANCED DATA

Shaik Abdullah; K Muddu Swamy

doi:10.62643/ijerst.2025.v21.i2.pp408-419

Authors

Shaik Abdullah Author
K Muddu Swamy Author

DOI:

https://doi.org/10.62643/ijerst.2025.v21.i2.pp408-419

Abstract

The subject of healthcare fraud detection is always changing and has many obstacles to overcome, especially when dealing with skewed data. Prior research mostly concentrated on conventional machine learning (ML) methods, which often had trouble handling unbalanced data. This issue comes up in a number of ways. Random Oversampling (ROS) raises the danger of overfitting, the Synthetic Minority Oversampling Technique (SMOTE) introduces noise, and Random Undersampling (RUS) may result in the loss of important information. Furthermore, increasing assessment metrics, investigating hybrid resampling strategies, and optimising model performance are essential for reaching greater accuracy with unbalanced datasets. With an emphasis on the Medicare Part B dataset, we address the problem of unbalanced datasets in healthcare fraud detection in this study using a unique technique. The categorical feature "Provider Type" is first meticulously extracted from the dataset. This increases the variety within the minority class by enabling us to create new, synthetic instances by randomly reproducing preexisting kinds. Next, we use a hybrid resampling technique called SMOTE-ENN, which combines Edited Nearest Neighbours (ENN) with the Synthetic Minority Over-sampling Technique (SMOTE). By creating artificial samples and eliminating noisy data, this technique seeks to balance the dataset and increase the models' accuracy. We classify the examples using six machine learning (ML) models. We use standard measures like as accuracy, F1 score, recall, precision, and the AUC-ROC curve to assess performance. We emphasise how important the Area Under the Precision-Recall Curve (AUPRC) is for evaluating performance in situations with unbalanced datasets. With a score of 0.99 on all measures, the studies demonstrate that Decision Trees (DT) performed better than any classifier.