Skip to main content

K - Means (Simplest clustering algorithm)

K- Means Algorithm

Introduction

K- Means algorithm is an unsupervised machine learning algorithm. It is the simplest algorithm to grouping the data based on theirs attribute values.

What is clustering?

Clustering is the process of grouping similar elements together.
As a example, we will take the salaries of the persons and grouping them as middle class, lower middle class and upper middle class.

Let's assume the example data set (Persons and their salaries)

Input Data set

Person 1 20000
Person 2 6000
Person 3 7000
Person 4 11000
Person 5 15000
Person 6 19000
Person 7 30000
Person 8 25000
Person 9 4000
Perso  10 8000

The similar amount of salaried persons can be grouped as like below

Lower Middle Class Middle Class Upper Middle Class
Person 2 Person 1 Person 7
Person 3 Person 4 Person 8
Person 9 Person 5 Person 1
Person 10 Person 6



Process of K-Means algorithm

The important concept of K-means algorithm is center point and their distance from the data.
Let's take the above person salary example and cluster them using K-Means. 

Step 1: In the example we need a three groups, So let's K=3 (K is a number of cluster which we required).

Step 2: Take the center point value for every group. Here our K value is 3, so wee need 3 Center points. In the initial step, we can take any of the data as center point.

Lets take the center points like C1=7000, C2=15000, C3=25000

Step 3: Calculate the distance (difference) between center points and data values



Distance from C1 Distance from C2 Distance from C3
Person 1 13000 5000 5000
Person 2 1000 9000 19000
Person 3 0 8000 18000
Person 4 4000 4000 14000
Person 5 8000 0 10000
Person 6 12000 4000 6000
Person 7 23000 15000 5000
Person 8 18000 10000 0
Person 9 3000 11000 21000
Person  10 1000 7000 17000

Step 4: Compare the distance values and assign the person to the group which has a lower distance value

Iteration 1:


Distance from C1 Distance from C2 Distance from C3 Assigned Group
Person 1 13000 5000 5000 C3
Person 2 1000 9000 19000 C1
Person 3 0 8000 18000 C1
Person 4 4000 4000 14000 C1
Person 5 8000 0 10000 C2
Person 6 12000 4000 6000 C2
Person 7 23000 15000 5000 C3
Person 8 18000 10000 0 C3
Person 9 3000 11000 21000 C1
Person  10 1000 7000 17000 C1

Step 5: Now the time to adjust our center point values. Initially we have selected the center point values as random data. So It will not be an exact middle value for the groups. To ensure the center point values as a middle value of the group. We need to find the average value form the group (which is assigned in the previous steps) and take that value as a center point.

New center point values are

C1=AVG(Person 2,Person3,Person4,Person9,Person10) = 7200
C2=AVG(Person5,Person6) = 17000
C2=AVG(Person1,Person7,Person8)=25000

Step 6: Now again calculate the distance with new center point value and change the group assignment with current lowest distance value group.

Iteration 2:


Distance from C1 Distance from C2 Distance from C3 Assigned Group
Person 1 12800 3000 5000 C2
Person 2 1200 11000 19000 C1
Person 3 200 10000 18000 C1
Person 4 3800 6000 14000 C1
Person 5 7800 2000 10000 C2
Person 6 11800 2000 6000 C2
Person 7 22800 13000 5000 C3
Person 8 17800 8000 0 C3
Person 9 3200 13000 21000 C1
Person  10 800 9000 17000 C1

Step 7: Repeat the steps 5 and 6, until previous iterations assigned group and current iteration assigned groups are same.

 Iteration 3:

 C1=AVG(Person 2,Person3,Person4,Person9,Person10) = 7200
 C2=AVG(Person1,Person5,Person6) =18000
 C3=AVG(Person7,Person8)=27500


Distance from C1 Distance from C2Z Distance from C3 Assigned Group
Person 1 12800 2000 7500 C2
Person 2 1200 12000 21500 C1
Person 3 200 11000 20500 C1
Person 4 3800 7000 16500 C1
Person 5 7800 3000 12500 C2
Person 6 11800 1000 8500 C2
Person 7 22800 12000 2500 C3
Person 8 17800 7000 2500 C3
Person 9 3200 14000 23500 C1
Person  10 800 10000 19500 C1





Step 8: The iteration 2 assignment and the iteration 3 assignments are looking same. So we can stop the process and conclude the clustering.

Resulted clusters

Cluster C1
  • Person 2  (6000)
  • Person 3  (7000)
  • Person 4  (11000)
  • Person 9  (4000)
  • Person 10 (8000)
Cluster C2
  • Person 1  (20000)
  • Person 5  (15000)
  • Person 6  (19000)
Cluster C3
  • Person 7  (25000)
  • Person 8  (30000)

Comments

Popular posts from this blog

Implementing Turn-by-Turn Navigation with the GraphHopper Routing Engine

 Introduction to Turn-by-Turn Navigation Turn-by-turn navigation is a feature that guides users through a journey step-by-step using voice or visual prompts. It is typically found in GPS navigation apps and can be used for driving, walking, and biking. When it comes to the actual implementation, turn-by-turn navigation utilizes several technologies such as GPS, maps, and routing algorithms to guide users along the best route to their destination. In this article, we will explore the implementation of turn-by-turn navigation using the GraphHopper library and OpenStreetMap data. Understanding the GraphHopper Routing Engine GraphHopper Routing Engine is an open-source routing library that allows developers to calculate efficient routes for various transportation modes, such as cars, bikes, and walking. It utilizes OpenStreetMap data to generate a routing graph, which is then used to calculate the fastest or shortest routes between two points. GraphHopper is written in Java and can be ...

Test Driven Development

What is Test Driven Development (TDD)? TDD is a process of writing the test code before writing the functional code. In the TDD the tests are developed even before implementing the feature. At the first level the test will fail because we are not yet implemented the functional code. To make the test pass, we need to implement the functional code. TDD workflow The TDD follows the flow of "Red, Green, Refactor"     Identify the smallest behavior of the application and add the test     The test may fail     Write the functional code to make the test pass     Run the test case and make sure the tests pass     Refactor the code, if required     Repeat the process for the complete application behaviors Flow diagram for Test Driven Development (TDD)   Advantage of TDD     Early bug identification     Good understanding of code     Be...

Android application test automation

Test automation for mobile applications  In the mobile application development, testing plays the major role to ensure the quality.  The test automation process helps to speed up the testing process while doing the more regressions. In this article, we are going to see about the Appium test framework and how it is used to automate the Android application testing. What is Appium? Appium is an open-source framework that allows QAs to conduct automated app testing on different platforms like Android, iOS, and Windows. Appium frameworks allows to create a test scripts in different programming languages like Java,C,C#. Architecture Appium is a node.js webserver it contains the implementation of Selenium webdriver and REST API. Source:  nishantverma.gitbooks.io Appium client libraries Appium provides the client libraries in different programming languages. These client libraries are used in the test scripts to fire the test.  Appium client library communicates with the App...