K-Means Clustering

  • Assume that the examples are d-dimensional vectors

  • Goal: group the examples into k clusters

  • Cluster i will be defined by a point in RdR^d called m (this is also called the cluster representative)

  • Assign each example to the cluster with the closest representative using Euclidean distance (if two cluster representatives are equally close, choose the lower valued one)

  • Recompute cluster representatives (by computing the mean of all the examples in the cluster i.e. computing the centroid)

  • Repeat the previous two steps until convergence i.e. till the representatives stop changing (it is guaranteed to converge in a finite number of steps because there are a finite number of ways in which n points can be assigned to k clusters i.e. knk^n ways)

Intuitively, a clustering is good if all the points are near their cluster representatives. The aim is to minimize the intra-cluster distance and to maximize the inter-cluster distance.

The error function is as follows:

txtρ(xt)2\sum_t ||x^t-\rho(x^t)||^2 where ρ(xt)\rho(x^t) is the cluster representative of the cluster to which xtx^t belongs.

Initially, we do not know the number of clusters. So, we can start with 2 clusters and keep increasing the number of clusters until the clustering seems satisfactory. Finding the optimal number of clusters is an NP-hard problem. Therefore, the K-Means clustering algorithm (heuristic) is not guaranteed to find the optimal k cluster representatives, but works well in practice.

Note, we can always get 0 error by choosing k=n, but this is not our aim. We must manually determine an optimal number of clusters for our data.

The final converged clustering may change based on changing the initial k cluster represenattives. So, how do we choose the initial k cluster representatives? Some approaches:

  • choose k random points from the dataset

  • choose k random points from the domain (they need not be from the dataset)

We could re-run the clustering with different initial represenatives, and choose the one with lower error/the one that seems to better satisfy our purposes.

Last updated