K-mean Clustering in Machine Learning

Introduction

Machine learning algorithms may be seen as optimization problems. They try to optimize the data and function after taking data samples, and an objective function. The objective function consists of the labels given to it in the case of supervised learning. We work our best to minimize the differences between the predictions and the real labels.  Things are different because of the lack of labels in the case of unsupervised learning.

Clustering algorithms put the data samples into clusters. They minimize the intracluster distances and maximize the intercluster distances. In another way, we can say that we need samples that are in a similar cluster to be as same as possible, and samples from different clusters to be as different as possible. In this post, we will understand and explore the simplest and popular unsupervised machine learning algorithms algorithm, known as K-means.

Description

K-means is a centroid-based algorithm. This is also known as a distance-based algorithm. We calculate the distances to assign a point to a cluster. In K-Means, each cluster is coupled with a centroid.  We’ll define a target number k  that refers to the number of centroids we need in the dataset. A centroid is the imaginary or real location constituting the center of the cluster. Each data point is allocated to every one of the clusters by decreasing the in-cluster sum of squares.

In another way, the K-means algorithm identifies k number of centroids. After that assigns every data point to the closest cluster. Though, keep the centroids as small as possible. In the K-means, the “mean” shows to averaging of the data. That is found in the centroid.

Working of K-mean algorithm

  • It begins by selecting K random points and setting them as cluster centroids.
  • After that, it assigns every data point to the closest centroid to it to make K clusters.
  • Then, it determines a new centroid for the newly formed clusters.
  • We require to go back on step “it assigns every data point to the closest centroid to it to make K clusters”.
  • That is to reassign the samples to their new clusters based on the updated centroids as the centroids have been updated.

We understand that the algorithm has converged. We pause if the centroids didn’t move much. This is an iterative algorithm as we can see. It remains to iterate until it converges. Though, we may limit the number of iterations by setting its max_iter hyperparameter. Furthermore,
we can take the decision to tolerate larger centroid movements and pause earlier by setting the tool hyperparameter to a bigger value. The diverse selections regarding the initial cluster centroids can lead to different results. Makes sure the initial centroids are distant from each other for setting the algorithm’s init hyperparameter to kmeans++. This normally leads to the best results than random initialization. The selection of K is similarly provided with the then_clusters hyperparameter.

K-means algorithm working with Python programming language

To illustrate the K-mean clustering, we’ll use the Scikit-learn library and some random data.

1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline

We’ll import the below libraries in our project as we can see from the above code :

  • Pandas for reading and writing spreadsheets
  • Numpy for taking out good computations
  • Matplotlib for visualization of data

2. Create random data

  • Below is the code for creating some random data in a two-dimensional space:
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()
  • There has been a total of 100 data points created and divided into two groups, of 50 points each.
  • Look at the below figure as to how the data is displayed on a two-dimensional space:
Create random data
3. Scikit-Learn use
  • To process the randomly created data, we’ll use some of the available functions in the Scikit-learn library as shown below code.
from sklearn.cluster import KMeans
Kmean = KMeans(n_clusters=2)
Kmean.fit(X
  • We have given arbitrarily k (n_clusters) an arbitrary value of two.
  • We would get the output of the K-means parameters by running the below code:
KMeans(algorithm=’auto’, copy_x=True, init=’k-means++’, max_iter=300
n_clusters=2, n_init=10, n_jobs=1, precompute_distances=’auto’,
random_state=None, tol=0.0001, verbose=0)

4. Finding the centroid

  • Use the following code to find the center of the clusters:
Kmean.cluster_centers_
  • The result of the value of the centroids would be as;
array([[-0.94665068, -0.97138368],
 [ 2.01559419, 2.02597093]])
  • Display the cluster centroids with green and red colors.
plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)
plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show()

Output

Finding the centroid

5: Algorithm test

Kmean.labels_
  • Result of running the above K-means algorithm code as;
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
  • We may find above that 50 data points belong to the 0 clusters. The remaining points belong to the 1 cluster.
  • Use the code below to predict the cluster of a data point:
sample_test=np.array([-3.0,-3.0])
second_test=sample_test.reshape(1, -1)
Kmean.predict(second_test)
  • The result would be as;
array([0])

K-means clustering algorithm code in Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()
from sklearn.cluster import KMeans
Kmean = KMeans(n_clusters=2)
Kmean.fit(X)
Kmean.cluster_centers_
plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)
plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show()
Kmean.labels_
sample_test=np.array([-3.0,-3.0])
second_test=sample_test.reshape(1, -1)
Kmean.predict(second_test)