K-Means Clustering
Last updated
Last updated
Assume that the examples are d-dimensional vectors
Goal: group the examples into k clusters
Cluster i will be defined by a point in called m (this is also called the cluster representative)
Assign each example to the cluster with the closest representative using Euclidean distance (if two cluster representatives are equally close, choose the lower valued one)
Recompute cluster representatives (by computing the mean of all the examples in the cluster i.e. computing the centroid)
Repeat the previous two steps until convergence i.e. till the representatives stop changing (it is guaranteed to converge in a finite number of steps because there are a finite number of ways in which n points can be assigned to k clusters i.e. ways)
Intuitively, a clustering is good if all the points are near their cluster representatives. The aim is to minimize the intra-cluster distance and to maximize the inter-cluster distance.
The error function is as follows:
where is the cluster representative of the cluster to which belongs.
Initially, we do not know the number of clusters. So, we can start with 2 clusters and keep increasing the number of clusters until the clustering seems satisfactory. Finding the optimal number of clusters is an NP-hard problem. Therefore, the K-Means clustering algorithm (heuristic) is not guaranteed to find the optimal k cluster representatives, but works well in practice.
Note, we can always get 0 error by choosing k=n, but this is not our aim. We must manually determine an optimal number of clusters for our data.
The final converged clustering may change based on changing the initial k cluster represenattives. So, how do we choose the initial k cluster representatives? Some approaches:
choose k random points from the dataset
choose k random points from the domain (they need not be from the dataset)
We could re-run the clustering with different initial represenatives, and choose the one with lower error/the one that seems to better satisfy our purposes.