Optimizing Performance with Parallel K-Means in Tunnel  Monitoring Data Clustering Algorithm for Cloud Computing

Vijaykumar Mamidala

Authors

Vijaykumar Mamidala Author

Keywords:

K-means clustering, Parallel computing, MapReduce, Scalability, Fault tolerance, Dynamic load balancing, Real-time processing, High-dimensional data

Abstract

The parallel K-means clustering approach, intended to maximize cloud computing performance in
tunnel monitoring data analysis, is introduced in the abstract. Large-scale datasets cannot benefit
from the high processing complexity of traditional sequential K-means. Parallel K-means, which
makes use of distributed computing frameworks like MapReduce, lessens these difficulties by
dividing up processing jobs among several nodes. This technique creates centroid representations
for each cluster in the dataset and updates them iteratively until convergence. Scalability,
performance optimization, effective data management, and fault tolerance are important goals that
are essential for cloud-based data processing pipelines. Research gaps still exist in dynamic load
balancing, parameter selection, real-time processing, energy efficiency, and managing high-
dimensional data, despite progress in these areas. The primary issue discussed is the inefficiency
of sequential K-means on big datasets, which is made worse by the modern data's growing amount,
diversity, and speed. The parallel K-means technique addresses the drawbacks of the sequential
approach and effectively clusters large datasets by leveraging MapReduce. Data preprocessing,
MapReduce-based algorithm execution, system architecture, and metrics for performance
assessment are all part of the methodology. The experimental design modifies variables like as the
number of clusters, size of the dataset, and number of iterations in order to evaluate execution time,
speed, scalability, and cluster quality. As a result of the notable performance gains shown by the
results, parallel K-means is crucial for contemporary data analytics, especially in cloud settings.
The goal of ongoing research is to improve real-time processing, parameter selection, and load
balancing in order to increase the algorithm's efficiency and suitability for use in big data
applications.