Hello there,

In this post, I've tried to apply a simple

**K-Means algorithm**(in**Python**) to a dataset obtained from**Gaussian mixture models**of say n-mixtures. A very good introduction about**K-Means**is available here in wikipedia page and very good documentation on actual implementation of K-Means algorithm on Iris dataset along with the codes (in Python) is available in__official scikit page here__
Here, I've modified the code taken from

**documentation/tutorial section and used dataset that has been sampled from '***scikit-learn**n*'*Gaussian mixtures*having different parameters for each mixture.### Code begins

**1.**

__Importing necessary libraries and initial settings__**import numpy as np**

**import matplotlib.pyplot as plt**

**from sklearn.cluster import KMeans**

**2.**

__Initialization of parameters__
In this example, we consider data points being drawn from three different Gaussian models, considering two features. Since we are dealing with two features, the plots will be in 2D, visualisation of which will be good/clear. So, all the data points are for this example are in the form of (x,y). If you consider trying with say 4 features, then the data points might be represented in the form of (x,y,z,r) etc. Let us first try out with three clustering case. So, the initial input/setting for the K-Mean algorithm will be the centres of these Gaussians.

**centers = [[1,1],[-1,-1],[1,-1]]**

We also need to initialise the parameters for these three Gaussian distributions. Assuming that we give manual inputs, let us consider the following mean and variance settings

**mu1 = [0,-2]**

**sig1 = [[2,0],[0,3]]**

**mu2 = [5,0]**

**sig2 = [[3,0],[0,1]]**

**mu3 = [3,0]**

**sig3 = [[1, 0],[0, 4]]**

Note that

**here mean the variance of the model and they should be diagonal. In particular, the sigma matrices should be positive definite with non-negative elements to perform well. If the variance were considered negative, then results would be unsatisfactory. mu here defines the Mean of the gaussian distribution.***sig***3.**

__Sampling data from the Gaussian models__
We consider drawing 100 data points from each Gaussian distribution. The code fro this would look like

**X1,y1 = np.random.multivariate_normal(mu1,sig1,100).T**

**X2,y2 = np.random.multivariate_normal(mu2,sig2,100).T**

**X3,y3 = np.random.multivariate_normal(mu3,sig3,100).T**

However the input to the K-Mean will not be in three different sets, but it has to be a single set, so we club them all to make a single array.

**x = np.concatenate((X1,X2,X3))**

**y = np.concatenate((y1,y2,y3))**

x and y will have dimensions (1L x 300L), but for K-Means fitting (or commonly, for any machine learning algorithm, we consider data in the from

**m x n**where**m**is the number of samples and**n**is the number of features. So, altering our data set**x= x.reshape(300,1)**

**y= y.reshape(300,1)**

Now that we have x and y data separately, it's time to join them together to form (x,y) format

**X = np.column_stack((x,y))**

**labels = ([1]*100)+([2]*100)+([3]*100)**

labels above will be used while doing scatter plots to visualise the data in different colors. First 100 data has been labelled as 1, next 100 data as 2, and the last 100 data samples as 3 as an initial setting.

**4.**

__Visualising the actual dataset__
Before implementing any ML algorithm, it's always good to visualise the date. Let us do this

**fig = plt.figure()**

**plt.scatter(x, y, c = labels)**

Now that you have seen the initial plot, let's implement K-means using scikit-learn which is pretty straight and damn simple.

**c = labels)**here is used to identify the label for each data and assign color accordingly**5.**

__KMeans in Scikit-learn__**estimator = KMeans(n_clusters=3, max_iter =100)**

Above snippet says that estimator is a K-Mean implementation with no. of clusters being 3. After the K-Means initialization. We need to fit the model using the available data and then predict the labels for these data points.

**esti_fit = estimator.fit(X)**

**y_predict = esti_fit.predict(X)**

**sklearn**does the looping and for every iteration, labelling will get changed for some dataset and the algorithm stops when the mean error difference between the current mean and the previous mean is below some threshold.

**6.**

__Visualizing the clustering output__
Finally, it's time to see how our K-Means has clustered the assumed sample data

**fig = plt.figure()**

**plt.scatter(x,y, c = y_predict)**

**plt.title('Kmean for cluster size = 3')**

As you can see, using scikit-learn library, implementation of Machine Learning algorithms are pretty easy except that you need to know how these algorithms actually work by going through the working theory, that runs inside the library.

Below shows the plots from the trial runs

Below are few challenges that you can consider giving a try for learning purpose

**c = y_predict**here is to grouup similar colors to the same labelled objects.**7.**__Outputs__Below shows the plots from the trial runs

Actual data sampled from three different Gaussian mixtures

Clustered data with n_cluster = 3

Clustered data with n_cluster = 2

Clustered data with n_cluster = 4

- Try changing the input settings - centroids, mean and sigma values
- Try considering different clusters like k=2,4,5,8 etc
- Try sampling less and/or more data from the gaussinas (instead of 300 as taken in this example)

Hope, this post helped you implement a very powerful unsupervised algorithm, yet with few lines of codes. Happy coding :)

Helpful one. Thanks for the blog.

ReplyDelete