K-Means has a function to group data into data clusters that can accept data without any category labels. K-Means Clustering Algorithm is also a non-hierarchy method.
The Clustering Algorithm method groups some data into groups, which explains that data in one group have the same characteristics. It has different features from data in other groups.
Cluster Sampling is a sampling technique in which population units are randomly selected from existing groups called ‘clusters’. Clustering or clustering is a problem that uses unsupervised learning techniques.
K-Means Clustering is a data analysis or Data Mining method that performs unsupervised learning modelling processes and uses techniques that classify data from various partitions.
K Means Clustering has an objective which is to minimize the objective function that has been set in the clustering process. Minimizing variation between 1 cluster by maximizing variation with data in other groups.
- K means clustering is a primary algorithm method implemented as follows.
- Determines the number of clusters
- Randomly distribute cluster data.
- Calculates the average of the data in the cluster.
- Using step 3, return according to the threshold value.
- Calculating the distance between data and centroid values (K means clustering)
- Distance space can be implemented to calculate distance.
- Data and centroids. An example of calculating the distance often used is the manhattan/city block distance.
K-Means Clustering Stages
To do clustering, we need several steps, including:
1. Determination of the value of k or cluster to be created
2. Initialize centroid values randomly.
3. Centroid is the centre value (centre) of a cluster. Suppose we set k = 3. Then centroids C1, C2, and C3 will be formed randomly.
4. Assigns each data point to the nearest centroid
5. This stage will calculate the distance for each data to the centroid that has been made using the Euclidean distance.
6. Recalculate the centroid value of the newly formed cluster.
7. This process is carried out by calculating the mean value of each data point in the cluster.
8. Perform optimization to meet the criteria by repeating steps 3 and 4.
When to Use K-Means Clustering
We can use K-means clustering if we group data based on specific variables where we cannot determine the output class (unsupervised learning).
We can use this algorithm if we are faced with a problem whose solution requires a process of segmentation or grouping into specific subgroups, such as market and customer analysis.
We can use this algorithm when making intuitive readings of the data we just got.
Terms and assumptions
To be able to run K-means clustering, several assumptions must be met, including:
- There are no outliers in our data.
- Our sample represents the entire population.
- There is no multicollinearity between variables/features.
Advantages and advantages
The followings are the advantages or advantages of the K-means clustering algorithm:
- Simple and easy-to-understand algorithm
- Fast processing
- Available in various tools or software
- Easy to apply
- Always delivers results, regardless of the data.
Weaknesses and disadvantages
While the weaknesses or shortcomings of the K-means clustering algorithm:
1. The results are sensitive to the number of clusters (K). You can use the Elbow Method by calculating and comparing the WCSS values. See the explanation below.
2. Sensitive to “seed” initialization. To solve this, we can apply the K-means++ algorithm.
3. Sensitive to outliers or outliers. The solution to this problem is to remove outliers with care and consideration.
4. Sensitive to data with variables that have different scales. So, we should always scale or standardize data before doing k-means clustering. One condition in which we should not normalise the data is when one variable has a higher weight or importance value than the other.
5. Assuming that each cluster is shaped like a circle (spherical) and it is difficult if the groups have different shapes.
Problems in K-Means:
1. Different Initialization
The first problem is caused by differences in the initialization process for each cluster member. This initialization process can cause some issues because the process is done randomly.
The random initialization process might get better results even though the convergence speed is slower.
2. Latent Problems
There is a latent problem in the clustering process. Several approaches can determine multiple clusters, such as Partition Entropy and GAP Statistics.
3. Converge Failure
There needs to be more convergence. Problems can occur at any time for the complex K-Means method because each data in the dataset is explicitly allocated.
4. Data Modeling Problems
There are common problems that occur in conducting data modelling methods.
The K-Means method ignores the shape of the formed cluster model even though the default cluster shape is round.
Overlapping problems are often overlooked because there is a problem that could be more difficult to detect. This problem can occur because the K-Means method lacks features that can see several issues. It is related to the K-Means method, such as detecting hidden problems.
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life, like:
- Academic performance
- Diagnostic systems
- Search engines
- Wireless sensor networks
How Does K-Means Clustering Works?
The K-Means algorithm is a method for identifying clusters in a given dataset. There are a few ways to determine the appropriate number of clusters, or “K”.
One method is to experiment with different values for K and assess which one results in the best cluster divisions. A
nother approach is to use the Elbow technique, which involves plotting the relationship between K and the sum of squared distances from the cluster centroid and selecting the value of K where the rate of change in this metric begins to decrease.
Once the value of K has been determined, the algorithm assigns each data point to the cluster centroid that is closest to it. The position of the centroids is then recalculated based on the data points assigned to them.
This process is repeated until the centroids no longer move, indicating that the clustering has converged.
To better understand this concept, consider an example of a grocery store dataset. Using the K-Means algorithm, we can determine how many clusters the data should be divided into.
For example, we might find that the data is best divided into three clusters: one for fresh produce, one for packaged goods, and one for dairy products.