Hello there,
In this post, I've implemented unsupervised clustering of Iris dataset using Gaussian mixture models (GMM) in python. A detailed introduction about GMM is available on this Wikipedia page. The original implementation of the code was done by McDickenson available here in Github- considering two Gaussian mixture model as inputs. Here, I've modified the code using Iris data as input in 2D.
Code begins
1. Importing necessary libraries and initial settings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
import random as rand
from sys import maxint
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
import random as rand
from sys import maxint
from sklearn import datasets
2. Importing data from Iris dataset
iris = datasets.load_iris()
X = iris.data[:,:2]
Y = iris.target
Y[:] = Y+1;
Note that I've considered only two dimension of the Iris data which is usually 4 dimensional for better visualisation purpose. iris.target holds the labels for the data and hence Y is nothing but an array of labels
Making a dictionary and dataframe out of the available data
data = {'x':X[:,0],'y':X[:,1], 'label':Y}
df = pd.DataFrame(data=data)
4. Initialization of parameters
iris = datasets.load_iris()
X = iris.data[:,:2]
Y = iris.target
Y[:] = Y+1;
Note that I've considered only two dimension of the Iris data which is usually 4 dimensional for better visualisation purpose. iris.target holds the labels for the data and hence Y is nothing but an array of labels
Making a dictionary and dataframe out of the available data
data = {'x':X[:,0],'y':X[:,1], 'label':Y}
df = pd.DataFrame(data=data)
3. Visualising the actual dataset
Before implementing any ML algorithm, it's always good to visualise the date. Let us do this
fig = plt.figure()
plt.scatter(data['x'],data['y'],34,c=Y)
c=Y here is used to identify the label for each data and assign color accordingly
4. Initialization of parameters
Gaussian Mixture models work using Expectation Maximisation algorithm which initially assumes parameters of the model, and using the assumed parameters, it updates the parameters by maximising it by applying to a function. Then it again uses this maximised parameters for the further expectation of the parameters and the cycle continues until convergence.
So, we initially need to input parameters (mean (mu), variance (sig) and the probability that a point can be drawn from the mixture (lambda)). Assuming that we are focused on three clusters and in order to do this, we sample three data points randomly from the input data as the initial means. Choosing some point which is not in the range of the dataset results in unexpected behaviour and you will not see clustering happening. So, the chosen initial means should be in the dataset range.
Choose three random index no's from the length of dataset
k1 = rand.randrange(len(X))
k2 = rand.randrange(len(X))
k3 = rand.randrange(len(X))
Make intial guesses using the index numbers
guess = { 'mu1':[X[k1,0],X[k1,1]],
'sig1':[[1, 0],[0, 1]],
'mu2':[X[k2,0],X[k2,1]],
'sig2':[[2, 0],[0, 1]],
'mu3':[X[k3,0],X[k3,1]],
'sig3':[[0.5, 0],[0, 1]],
'lambda':[0.3,0.3,0.4]
}
Some other initials
shift = maxint
epsilon = 0.01
iters = 0
df_copy = df.copy()
Initially, assign all data points randomly to one of the clusters
Further
params = pd.DataFrame.from_dict(guess, orient = 'index')
params = params.transpose()
5. Loop until convergence
while shift > epsilon:
iters += 1
Expectation-step
updated_labels = expectation(df_copy.copy(), params)
Maximization-step
Updating the shift distance
shift = distance(params, updated_parameters)
Finally, updating the new labels and parameters for the next iteration
df_copy = updated_labels
params = updated_parameters
Function: expectation()
Function prob above is given by
Function: prob()
def prob(val, mu, sig, lam):
p = lam
for i in range(len(val)):
p *= norm.pdf(val[i], mu[i], sig[i][i])
return p
Now, fucntion Maximization looks like
Function: maximization()
Now that we have all the functions working, putting it together and running the code will result in the following results
Hope, this post helped you
You can download the code from Github repository here for free
So, we initially need to input parameters (mean (mu), variance (sig) and the probability that a point can be drawn from the mixture (lambda)). Assuming that we are focused on three clusters and in order to do this, we sample three data points randomly from the input data as the initial means. Choosing some point which is not in the range of the dataset results in unexpected behaviour and you will not see clustering happening. So, the chosen initial means should be in the dataset range.
Choose three random index no's from the length of dataset
k1 = rand.randrange(len(X))
k2 = rand.randrange(len(X))
k3 = rand.randrange(len(X))
Make intial guesses using the index numbers
guess = { 'mu1':[X[k1,0],X[k1,1]],
'sig1':[[1, 0],[0, 1]],
'mu2':[X[k2,0],X[k2,1]],
'sig2':[[2, 0],[0, 1]],
'mu3':[X[k3,0],X[k3,1]],
'sig3':[[0.5, 0],[0, 1]],
'lambda':[0.3,0.3,0.4]
}
Some other initials
shift = maxint
epsilon = 0.01
iters = 0
df_copy = df.copy()
Initially, assign all data points randomly to one of the clusters
df_copy['label']=map(lambda x:x+1, np.random.choice (3,len(df)))
Further
params = pd.DataFrame.from_dict(guess, orient = 'index')
params = params.transpose()
5. Loop until convergence
while shift > epsilon:
iters += 1
Expectation-step
updated_labels = expectation(df_copy.copy(), params)
Maximization-step
updated_parameters = maximization(updated_labels, params.copy())
Updating the shift distance
shift = distance(params, updated_parameters)
Finally, updating the new labels and parameters for the next iteration
df_copy = updated_labels
params = updated_parameters
The function here is used to evaluate whether the distance between points within the cluster is reducing or not. In other words, it's acting like a cost function. If the distance change is very less (less than epsilon threshold) then it's time to stop the algorithm. This function looks like
Function: distance()
def distance(old_params, new_params):
dist = 0
for param in ['mu1','mu2','mu3']:
for i in range(len(old_params)):
dist+=(old_params[param][i]-new_params[param][i])**2
return dist**0.5
The function Expectation assigns labels based on the probability of the data sample being sampled from a Gaussian mixture.
Function: expectation()
def expectation(dataFrame, parameters):
for i in range(dataFrame.shape[0]):
x = dataFrame['x'][i]
y = dataFrame['y'][i]
#assigning the probablilities of each cluster
p_cluster1=prob([x,y],list(parameters['mu1']), list(parameters['sig1']),parameters['lambda'][0])
p_cluster2=prob([x, y],list(parameters['mu2']), list(parameters['sig2']) parameters['lambda'][1])
p_cluster3=prob([x, y],list(parameters['mu3']), list(parameters['sig3']),parameters['lambda'][2])
# Labelling each data according to the probabilities of cluster
if(p_cluster1>=p_cluster2)&(p_cluster1>= p_cluster3):
dataFrame['label'][i] =1
elif(p_cluster2>=p_cluster1)&(p_cluster2>= p_cluster3):
dataFrame['label'][i] = 2
elif(p_cluster3>=p_cluster1)&(p_cluster3>= p_cluster2):
dataFrame['label'][i]=3
else:dataFrame['label'][i]=np.random.choice(3, len(df))+1
return dataFrame
Function prob above is given by
Function: prob()
def prob(val, mu, sig, lam):
p = lam
for i in range(len(val)):
p *= norm.pdf(val[i], mu[i], sig[i][i])
return p
Now, fucntion Maximization looks like
Function: maximization()
def maximization(dataFrame, parameters):
points_assigned_to_cluster1=dataFrame[dataFrame['label']==1] points_assigned_to_cluster2=dataFrame[dataFrame['label']==2] points_assigned_to_cluster3=dataFrame[dataFrame['label'] == 3]
percent_assigned_to_cluster1=len(points_assigned_to_cluster1)/float(len(dataFrame))
percent_assigned_to_cluster2=len(points_assigned_to_cluster2)/float(len(dataFrame))
percent_assigned_to_cluster3=1-percent_assigned_to_cluster1-percent_assigned_to_cluster2
parameters['lambda']=[percent_assigned_to_cluster1,percent_assigned_to_cluster2,percent_assigned_to_cluster3 ]
parameters['mu1']=[points_assigned_to_cluster1['x'].mean(),points_assigned_to_cluster1['y'].mean(),None]
parameters['mu2']=[points_assigned_to_cluster2['x'].mean(),points_assigned_to_cluster2['y'].mean(),None]
parameters['mu3']=[points_assigned_to_cluster3['x'].mean(),points_assigned_to_cluster3['y'].mean(),None]
parameters['sig1']= [[points_assigned_to_cluster1['x'].std(),0 ], [0,points_assigned_to_cluster1['y'].std()],None]
parameters['sig2']= [[points_assigned_to_cluster2['x'].std(),0], [0,points_assigned_to_cluster2['y'].std()],None]
parameters['sig3']= [[points_assigned_to_cluster3['x'].std(),0], [0,points_assigned_to_cluster3['y'].std()],None]
return parameters
Now that we have all the functions working, putting it together and running the code will result in the following results
Actual 2D Iris data sample
Three clusters from GMM algorithm
Modifying the same code for two clusters resulted in the following result after just two iteration
Two clusters from GMM algorithm
Hope, this post helped you
You can download the code from Github repository here for free
Bhanu, Can I have your mail ID or contact number, I need to ask you something.
ReplyDeleteHi,
ReplyDeleteCould you please give an explanation by putting numerical values for a single iteration?
Thanks in advance