K- Means Algorithm
Introduction
K- Means algorithm is an unsupervised machine learning algorithm. It is the simplest algorithm to grouping the data based on theirs attribute values.
What is clustering?
Clustering is the process of grouping similar elements together.
As a example, we will take the salaries of the persons and grouping them as middle class, lower middle class and upper middle class.
As a example, we will take the salaries of the persons and grouping them as middle class, lower middle class and upper middle class.
Let's assume the example data set (Persons and their salaries)
Input Data set
Person 1 | 20000 |
Person 2 | 6000 |
Person 3 | 7000 |
Person 4 | 11000 |
Person 5 | 15000 |
Person 6 | 19000 |
Person 7 | 30000 |
Person 8 | 25000 |
Person 9 | 4000 |
Perso 10 | 8000 |
The similar amount of salaried persons can be grouped as like below
Lower Middle Class | Middle Class | Upper Middle Class |
Person 2 | Person 1 | Person 7 |
Person 3 | Person 4 | Person 8 |
Person 9 | Person 5 | Person 1 |
Person 10 | Person 6 | |
Process of K-Means algorithm
The important concept of K-means algorithm is center point and their distance from the data.
Let's take the above person salary example and cluster them using K-Means.
Step 1: In the example we need a three groups, So let's K=3 (K is a number of cluster which we required).
Step 2: Take the center point value for every group. Here our K value is 3, so wee need 3 Center points. In the initial step, we can take any of the data as center point.
Lets take the center points like C1=7000, C2=15000, C3=25000
Step 3: Calculate the distance (difference) between center points and data values
Distance from C1 | Distance from C2 | Distance from C3 | |
Person 1 | 13000 | 5000 | 5000 |
Person 2 | 1000 | 9000 | 19000 |
Person 3 | 0 | 8000 | 18000 |
Person 4 | 4000 | 4000 | 14000 |
Person 5 | 8000 | 0 | 10000 |
Person 6 | 12000 | 4000 | 6000 |
Person 7 | 23000 | 15000 | 5000 |
Person 8 | 18000 | 10000 | 0 |
Person 9 | 3000 | 11000 | 21000 |
Person 10 | 1000 | 7000 | 17000 |
Step 4: Compare the distance values and assign the person to the group which has a lower distance value
Iteration 1:
Distance from C1 | Distance from C2 | Distance from C3 | Assigned Group | |
Person 1 | 13000 | 5000 | 5000 | C3 |
Person 2 | 1000 | 9000 | 19000 | C1 |
Person 3 | 0 | 8000 | 18000 | C1 |
Person 4 | 4000 | 4000 | 14000 | C1 |
Person 5 | 8000 | 0 | 10000 | C2 |
Person 6 | 12000 | 4000 | 6000 | C2 |
Person 7 | 23000 | 15000 | 5000 | C3 |
Person 8 | 18000 | 10000 | 0 | C3 |
Person 9 | 3000 | 11000 | 21000 | C1 |
Person 10 | 1000 | 7000 | 17000 | C1 |
Step 5: Now the time to adjust our center point values. Initially we have selected the center point values as random data. So It will not be an exact middle value for the groups. To ensure the center point values as a middle value of the group. We need to find the average value form the group (which is assigned in the previous steps) and take that value as a center point.
New center point values are
C1=AVG(Person 2,Person3,Person4,Person9,Person10) = 7200
C2=AVG(Person5,Person6) = 17000
C2=AVG(Person1,Person7,Person8)=25000
Step 6: Now again calculate the distance with new center point value and change the group assignment with current lowest distance value group.
Iteration 2:
Distance from C1 | Distance from C2 | Distance from C3 | Assigned Group | |
Person 1 | 12800 | 3000 | 5000 | C2 |
Person 2 | 1200 | 11000 | 19000 | C1 |
Person 3 | 200 | 10000 | 18000 | C1 |
Person 4 | 3800 | 6000 | 14000 | C1 |
Person 5 | 7800 | 2000 | 10000 | C2 |
Person 6 | 11800 | 2000 | 6000 | C2 |
Person 7 | 22800 | 13000 | 5000 | C3 |
Person 8 | 17800 | 8000 | 0 | C3 |
Person 9 | 3200 | 13000 | 21000 | C1 |
Person 10 | 800 | 9000 | 17000 | C1 |
Step 7: Repeat the steps 5 and 6, until previous iterations assigned group and current iteration assigned groups are same.
Iteration 3:
C1=AVG(Person 2,Person3,Person4,Person9,Person10) = 7200
C2=AVG(Person1,Person5,Person6) =18000
C3=AVG(Person7,Person8)=27500
Distance from C1 | Distance from C2Z | Distance from C3 | Assigned Group | |
Person 1 | 12800 | 2000 | 7500 | C2 |
Person 2 | 1200 | 12000 | 21500 | C1 |
Person 3 | 200 | 11000 | 20500 | C1 |
Person 4 | 3800 | 7000 | 16500 | C1 |
Person 5 | 7800 | 3000 | 12500 | C2 |
Person 6 | 11800 | 1000 | 8500 | C2 |
Person 7 | 22800 | 12000 | 2500 | C3 |
Person 8 | 17800 | 7000 | 2500 | C3 |
Person 9 | 3200 | 14000 | 23500 | C1 |
Person 10 | 800 | 10000 | 19500 | C1 |
Resulted clusters
Cluster C1
- Person 2 (6000)
- Person 3 (7000)
- Person 4 (11000)
- Person 9 (4000)
- Person 10 (8000)
- Person 1 (20000)
- Person 5 (15000)
- Person 6 (19000)
- Person 7 (25000)
- Person 8 (30000)
Comments
Post a Comment