K-Means_Clustering — Prediction using Unsupervised ML
Author Name : Jigar Shekhat
Project Topic : — Data Science & Business Analytics
intern at The Sparks Foundation
Task no. 2 :
Prediction using Unsupervised ML
Dowload Project link — Dowload now
http://github.com/Jigar710/Prediction_using_Unsupervised_ML
Here, We are exploring unsupervised machine learning using Python. We will predict the optimum number of clusters from iris dataset and visualize it.
K-Means
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
- Determines the best value for K center points or centroids by an iterative process.
- Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.
More info about K-Means Clustering
- Importing all the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
2. Loading the iris Dataset
#from sklearn import datasets
#iris = datasets.load_iris()
#iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df = pd.read_csv("iris.csv")
print(iris_df.head()) # See the first 5 rows
3. Finding the optimum number of clusters
x = iris_df.iloc[:, [0, 1, 2, 3]].values
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++',
max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
# Plotting the results onto a line graph,
# `allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # Within cluster sum of squares
plt.show()
4. Applying k means to the dataset
#Applying k means to the dataset / Creating the k means classifier
kmeans = KMeans(n_clusters = 3, init = 'k-means++',
max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)
5. Ploting the Clusters
#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'purple', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'orange', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'red', label = 'Centroids')
plt.legend()
6. 3D scatterplot using matplotlib
fig = plt.figure(figsize = (15,15))
ax = fig.add_subplot(111, projection='3d')
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 50, c = 'purple', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 50, c = 'orange', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 50, c = 'green', label = 'Iris-virginica')
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 50, c = 'red', label = 'Centroids')
plt.show()
7. Labeling the predictions
#considering 0 Corresponds to 'Iris-setosa'
#1 to 'Iris-versicolour'
#2 to 'Iris-virginica'
y_kmeans = np.where(y_kmeans==0, 'Iris-setosa', y_kmeans)
y_kmeans = np.where(y_kmeans=='1', 'Iris-versicolour', y_kmeans)
y_kmeans = np.where(y_kmeans=='2', 'Iris-virginica', y_kmeans)
8. Adding the prediction to the dataset
data_with_clusters = iris_df.copy()
data_with_clusters["Cluster"] = y_kmeans
print(data_with_clusters.head(5))
9. Bar plot-Cluster Distribution
# Bar plot
sns.set_style('darkgrid')
sns.barplot(x = data_with_clusters["Cluster"] .unique(),
y = data_with_clusters["Cluster"] .value_counts(),
palette=sns.color_palette(["#e74c3c", "#34495e", "#2ecc71"]));
10. Pair Plot
### hue = species colours plot as per species
### It will give 3 colours in the plot
sns.set_style('whitegrid') ### Sets grid style
sns.pairplot(data_with_clusters,hue = 'Cluster');
11. PairPlot insights
- petal-length and petal-width seem to be positively correlated (seem to be having a linear relationship).
- Iris-Setosa seems to have smaller petal length and petal width as compared to others.
- Looking at the overall scenario, it seems to be the case that Iris-Setosa has smaller dimensions than other flowers.