Thursday 29 June 2017

K-Means scikit implementation on Gaussian mixture sample data


Hello there,

In this post, I've tried to apply a simple K-Means algorithm (in Python) to a dataset obtained from Gaussian mixture models of say n-mixtures. A very good introduction about K-Means is available here in wikipedia page and very good documentation on actual implementation of K-Means algorithm on Iris dataset along with the codes (in Python) is available in official scikit page here

Here, I've modified the code taken from scikit-learn documentation/tutorial section and used dataset that has been sampled from 'n' Gaussian mixtures having different parameters for each mixture. 

Code begins


1. Importing necessary libraries and initial settings

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans 

2. Initialization of parameters

In this example, we consider data points being drawn from three different Gaussian models, considering two features. Since we are dealing with two features, the plots will be in 2D, visualisation of which will be good/clear. So, all the data points are for this example are in the form of (x,y). If you consider trying with say 4 features, then the data points might be represented in the form of (x,y,z,r) etc. Let us first try out with three clustering case. So, the initial input/setting for the K-Mean algorithm will be the centres of these Gaussians.

centers = [[1,1],[-1,-1],[1,-1]]

We also need to initialise the parameters for these three Gaussian distributions. Assuming that we give manual inputs, let us consider the following mean and variance settings

mu1 = [0,-2]
sig1 = [[2,0],[0,3]]
mu2 = [5,0]
sig2 = [[3,0],[0,1]]
mu3 = [3,0]
sig3 = [[1, 0],[0, 4]]

Note that sig here mean the variance of the model and they should be diagonal. In particular, the sigma matrices should be positive definite with non-negative elements to perform well. If the variance were considered negative, then results would be unsatisfactory. mu here defines the Mean of the gaussian distribution. 


3. Sampling data from the Gaussian models

We consider drawing 100 data points from each Gaussian distribution. The code fro this would look like 

X1,y1 = np.random.multivariate_normal(mu1,sig1,100).T
X2,y2 = np.random.multivariate_normal(mu2,sig2,100).T
X3,y3 = np.random.multivariate_normal(mu3,sig3,100).T  
However the input to the K-Mean will not be in three different sets, but it has to be a single set, so we club them all to make a single array. 

x = np.concatenate((X1,X2,X3))
y = np.concatenate((y1,y2,y3))

x and y will have dimensions (1L x 300L), but for K-Means fitting (or commonly, for any machine learning algorithm, we consider data in the from m x n where m is the number of samples and n is the number of features. So, altering our data set

x= x.reshape(300,1)
y= y.reshape(300,1)

Now that we have x and y data separately, it's time to join them together to form (x,y) format

X = np.column_stack((x,y))
labels = ([1]*100)+([2]*100)+([3]*100)

labels above will be used while doing scatter plots to visualise the data in different colors. First 100 data has been labelled as 1, next 100 data as 2, and the last 100 data samples as 3 as an initial setting.

4. Visualising the actual dataset

Before implementing any ML algorithm, it's always good to visualise the date. Let us do this

fig = plt.figure()
plt.scatter(x, y, c = labels)

Now that you have seen the initial plot, let's implement K-means using scikit-learn which is pretty straight and damn simple. c = labels) here is used to identify the label for each data and assign color accordingly

5. KMeans in Scikit-learn

estimator = KMeans(n_clusters=3, max_iter =100)

Above snippet says that estimator is a K-Mean implementation with no. of clusters being 3. After the K-Means initialization. We need to fit the model using the available data and then predict the labels for these data points.

esti_fit = estimator.fit(X)
y_predict = esti_fit.predict(X)

sklearn does the looping and for every iteration, labelling will get changed for some dataset and the algorithm stops when the mean error difference between the current mean and the previous mean is below some threshold. 

6. Visualizing the clustering output

Finally, it's time to see how our K-Means has clustered the assumed sample data

fig = plt.figure()
plt.scatter(x,y, c = y_predict)
plt.title('Kmean for cluster size = 3')

As you can see, using scikit-learn library, implementation of Machine Learning algorithms are pretty easy except that you need to know how these algorithms actually work by going through the working theory, that runs inside the library. c = y_predict here is to grouup similar colors to the same labelled objects.

7. Outputs

Below shows the plots from the trial runs

Actual data sampled from three different Gaussian mixtures

Clustered data with n_cluster = 3

Clustered data with n_cluster = 2

Clustered data with n_cluster = 4


Below are few challenges that you can consider giving a try for learning purpose
  • Try changing the input settings - centroids, mean and sigma values
  • Try considering different clusters like k=2,4,5,8 etc
  • Try  sampling less and/or more data from the gaussinas (instead of 300 as taken in this example)


Hope, this post helped you implement a very powerful unsupervised algorithm, yet with few lines of codes. Happy coding :)


1 comment: