K-means algorithm

K-means clustering is a type of unsupervised learning, which is used for unlabelled data (data without defined categories or groups). K-Means is used for example for image segmentation, clustering gene segmentation data, news article clustering, clustering languages, species clustering and anomaly detection.

The algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are the centroids of the K clusters, which can be used to label new data and labels for the training data (each data point is assigned to a single cluster).

  1. Pick k random items from the dataset and label them as cluster centroids. We often know the value of k, but if not, use the Elbow Method.
  2. Associate each remaining item in the dataset to the nearest cluster centroid by calculating its distance to each centroid (using a Euclidean distance calculated by a similarity function).
  3. Find new cluster centre by taking the average of the assigned points (= recalculate the new clusters’ centroids).
  4. Repeat Steps 2 and 3 until the clusters do not change.

As a result, the “cluster center” is the arithmetic mean of all the points belonging to the cluster and each point is closer to its own cluster centre than to other cluster centres. Example in python is here.

  • Last modified: 2019/09/08 13:57