CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page

Was this helpful?

  1. Types of Machine Learning
  2. Unsupervised Learning

K-Means Clustering

  • Assume that the examples are d-dimensional vectors

  • Goal: group the examples into k clusters

  • Cluster i will be defined by a point in RdR^dRd called m (this is also called the cluster representative)

  • Assign each example to the cluster with the closest representative using Euclidean distance (if two cluster representatives are equally close, choose the lower valued one)

  • Recompute cluster representatives (by computing the mean of all the examples in the cluster i.e. computing the centroid)

  • Repeat the previous two steps until convergence i.e. till the representatives stop changing (it is guaranteed to converge in a finite number of steps because there are a finite number of ways in which n points can be assigned to k clusters i.e. knk^nkn ways)

Intuitively, a clustering is good if all the points are near their cluster representatives. The aim is to minimize the intra-cluster distance and to maximize the inter-cluster distance.

The error function is as follows:

∑t∣∣xt−ρ(xt)∣∣2\sum_t ||x^t-\rho(x^t)||^2∑t​∣∣xt−ρ(xt)∣∣2 where ρ(xt)\rho(x^t)ρ(xt) is the cluster representative of the cluster to which xtx^txt belongs.

Initially, we do not know the number of clusters. So, we can start with 2 clusters and keep increasing the number of clusters until the clustering seems satisfactory. Finding the optimal number of clusters is an NP-hard problem. Therefore, the K-Means clustering algorithm (heuristic) is not guaranteed to find the optimal k cluster representatives, but works well in practice.

Note, we can always get 0 error by choosing k=n, but this is not our aim. We must manually determine an optimal number of clusters for our data.

The final converged clustering may change based on changing the initial k cluster represenattives. So, how do we choose the initial k cluster representatives? Some approaches:

  • choose k random points from the dataset

  • choose k random points from the domain (they need not be from the dataset)

We could re-run the clustering with different initial represenatives, and choose the one with lower error/the one that seems to better satisfy our purposes.

PreviousUnsupervised LearningNextProbabilistic Clustering

Last updated 5 years ago

Was this helpful?