The Data System Gap: Identifying Inconsistencies in Information Pipelines
DOI:
https://doi.org/10.62643/Keywords:
Data Quality, Data Profiling, Pipeline Debugging, Data Repair, Machine Learning, DjangoAbstract
Data-driven system failures often originate from inconsistencies within the data rather than software logic errors. This paper implements a DataPrism-inspired system for detecting and repairing dataset issues in ML pipelines. The system compares reference (passing) and problematic (failing) datasets to identify discriminative data profiles including missing values, domain drift, population bias, and schema inconsistencies. Detected issues are corrected through automated data imputation and normalization. A Django web interface enables dataset analysis, issue visualization, repair, and download. ML model evaluation (Random Forest, Gradient Boosting, SVM, AdaBoost, Logistic Regression) on original versus repaired datasets demonstrates that data repair improves average model accuracy by 28.5%. The system provides an automated approach for detecting data-system mismatches and improving data-driven system reliability.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.













