DATA CLEANING AND PREPROCESSING AUTOMATION IN PYTHON: AN EFFICIENT PIPELINE APPROACH

Y. Suresh Babu; K. Vijayakumar

doi:10.62643/

Authors

Y. Suresh Babu Author
K. Vijayakumar Author

DOI:

https://doi.org/10.62643/

Keywords:

Data preprocessing, Data cleaning, Automation, Python, Machine learning pipeline, Missing value imputation, Feature scaling, Data encoding, Outlier detection, Imbalanced data handling, Adaptive preprocessing, Automated data preparation.

Abstract

Data preprocessing is a critical stage in the machine learning workflow, directly influencing model accuracy and reliability. However, traditional preprocessing is often repetitive, time-consuming, and prone to human error. This paper presents an automated and efficient data cleaning and preprocessing pipeline developed in Python, designed to minimize manual intervention while improving data quality. The proposed system intelligently detects data inconsistencies such as missing values, outliers, and imbalanced distributions, and applies suitable treatments including imputation, scaling, encoding, and feature optimization. By integrating adaptive selection mechanisms and dynamic decision rules, the framework ensures optimal preprocessing tailored to dataset characteristics. Experimental evaluations across multiple datasets demonstrate that the automated pipeline significantly reduces preparation time and enhances model performance compared to manual preprocessing methods. This approach offers a scalable, reliable, and user-friendly solution for data scientists and analysts seeking streamlined machine learning workflows.