K - Means (Simplest clustering algorithm)

K- Means Algorithm

Introduction

K- Means algorithm is an unsupervised machine learning algorithm. It is the simplest algorithm to grouping the data based on theirs attribute values.

What is clustering?

Clustering is the process of grouping similar elements together.
As a example, we will take the salaries of the persons and grouping them as middle class, lower middle class and upper middle class.

Let's assume the example data set (Persons and their salaries)

Input Data set

Person 1	20000
Person 2	6000
Person 3	7000
Person 4	11000
Person 5	15000
Person 6	19000
Person 7	30000
Person 8	25000
Person 9	4000
Perso 10	8000

The similar amount of salaried persons can be grouped as like below

Lower Middle Class	Middle Class	Upper Middle Class
Person 2	Person 1	Person 7
Person 3	Person 4	Person 8
Person 9	Person 5	Person 1
Person 10	Person 6

Process of K-Means algorithm

The important concept of K-means algorithm is center point and their distance from the data.

Let's take the above person salary example and cluster them using K-Means.

Step 1: In the example we need a three groups, So let's K=3 (K is a number of cluster which we required).

Step 2: Take the center point value for every group. Here our K value is 3, so wee need 3 Center points. In the initial step, we can take any of the data as center point.

Lets take the center points like C1=7000, C2=15000, C3=25000

Step 3: Calculate the distance (difference) between center points and data values

	Distance from C1	Distance from C2	Distance from C3
Person 1	13000	5000	5000
Person 2	1000	9000	19000
Person 3	0	8000	18000
Person 4	4000	4000	14000
Person 5	8000	0	10000
Person 6	12000	4000	6000
Person 7	23000	15000	5000
Person 8	18000	10000	0
Person 9	3000	11000	21000
Person 10	1000	7000	17000

Step 4: Compare the distance values and assign the person to the group which has a lower distance value

Iteration 1:

	Distance from C1	Distance from C2	Distance from C3	Assigned Group
Person 1	13000	5000	5000	C3
Person 2	1000	9000	19000	C1
Person 3	0	8000	18000	C1
Person 4	4000	4000	14000	C1
Person 5	8000	0	10000	C2
Person 6	12000	4000	6000	C2
Person 7	23000	15000	5000	C3
Person 8	18000	10000	0	C3
Person 9	3000	11000	21000	C1
Person 10	1000	7000	17000	C1

Step 5: Now the time to adjust our center point values. Initially we have selected the center point values as random data. So It will not be an exact middle value for the groups. To ensure the center point values as a middle value of the group. We need to find the average value form the group (which is assigned in the previous steps) and take that value as a center point.

New center point values are

C1=AVG(Person 2,Person3,Person4,Person9,Person10) = 7200
C2=AVG(Person5,Person6) = 17000
C2=AVG(Person1,Person7,Person8)=25000

Step 6: Now again calculate the distance with new center point value and change the group assignment with current lowest distance value group.

Iteration 2:

	Distance from C1	Distance from C2	Distance from C3	Assigned Group
Person 1	12800	3000	5000	C2
Person 2	1200	11000	19000	C1
Person 3	200	10000	18000	C1
Person 4	3800	6000	14000	C1
Person 5	7800	2000	10000	C2
Person 6	11800	2000	6000	C2
Person 7	22800	13000	5000	C3
Person 8	17800	8000	0	C3
Person 9	3200	13000	21000	C1
Person 10	800	9000	17000	C1

Step 7: Repeat the steps 5 and 6, until previous iterations assigned group and current iteration assigned groups are same.

Iteration 3:

C1=AVG(Person 2,Person3,Person4,Person9,Person10) = 7200

C2=AVG(Person1,Person5,Person6) =18000

C3=AVG(Person7,Person8)=27500

	Distance from C1	Distance from C2Z	Distance from C3	Assigned Group
Person 1	12800	2000	7500	C2
Person 2	1200	12000	21500	C1
Person 3	200	11000	20500	C1
Person 4	3800	7000	16500	C1
Person 5	7800	3000	12500	C2
Person 6	11800	1000	8500	C2
Person 7	22800	12000	2500	C3
Person 8	17800	7000	2500	C3
Person 9	3200	14000	23500	C1
Person 10	800	10000	19500	C1

Step 8: The iteration 2 assignment and the iteration 3 assignments are looking same. So we can stop the process and conclude the clustering.

Resulted clusters

Cluster C1

Person 2 (6000)
Person 3 (7000)
Person 4 (11000)
Person 9 (4000)
Person 10 (8000)

Cluster C2

Person 1 (20000)
Person 5 (15000)
Person 6 (19000)

Cluster C3

Person 7 (25000)
Person 8 (30000)

Kavin Duraisamy - Sharing my thoughts

Search This Blog