Implementation of KMEAN clustering algorithm with Pandas and Numpy

Kmeans is a clustering algorithm that seeks to find natural groupings in unlabeled data. This project seeks to mimic how the KMeans works in a naive approach.

How it works

Cluster K number of centroids are initialized randomly from the data. K is the number of clusters you want to use.
The distance between each centriod and all the records in the dataframe is calculated. (Used Euclidean Distance).
Each datapoint is assigned to the cluster of it's nearest centroid.
New centriods are adjusted as the means of each cluster for all the variables in the data.
Steps 2-4 are repeated until the centroids can no longer change (convergence) or the maximum number of iterations is reached.

How to use it

    kmeans = NaiveKMeans(df=dataframe, k=2)
    clusters = kmeans.predict()
    n_iterations = kmeans.n_iterations
    final_centroids = kmeans.centroids

The predict method returns the columns in the original dataframe together with the clusters column which is the cluster that each datapoint is assigned to.

Selecting K best

We use the Elbow curve to determine the optimal number of clusters in our data.
First calculate inertia - inertia measures how far datapoints assigned to a cluster are from that clusters centroid.
We want to get the fewest number of clusters that give the lowest inertia.

    inertias = []
    for i in range(2, 11):
        km = NaiveKmeans(df, i)
        km.predict()
        inertia = km.inertia()
        inertias.append(inertia)

Visualize the inertia and the number of cluster, where the elbow begins the form should be the optimal clusters.

In our graph above, the best number of cluster is k=5.
Sometimes in certain real world scenarios, the number of cluster could already be predefined.

Limitations

The algorithm is not optimized for large datasets.

Contibutions

Changes can be made to further improve the algorithm.
Contributions are very much welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
Mall_Customers.csv		Mall_Customers.csv
Naive_KMeans.ipynb		Naive_KMeans.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementation of KMEAN clustering algorithm with Pandas and Numpy

How it works

How to use it

Selecting K best

Limitations

Contibutions

About

Releases

Packages

Languages

regan-mu/KMeans-Implementation-with-pandas-numpy

Folders and files

Latest commit

History

Repository files navigation

Implementation of KMEAN clustering algorithm with Pandas and Numpy

How it works

How to use it

Selecting K best

Limitations

Contibutions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages