Skip to content

Commit 9c61a5a

Browse files
authored
Merge pull request #37 from PyDataBlog/experimental
Experimental
2 parents cb79211 + 4866573 commit 9c61a5a

28 files changed

+2664
-1620
lines changed

.github/workflows/CompatHelper.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ jobs:
1313
steps:
1414
- uses: julia-actions/setup-julia@latest
1515
with:
16-
version: 1.3
16+
version: 1.4
1717
- name: Pkg.add("CompatHelper")
1818
run: julia -e 'using Pkg; Pkg.add("CompatHelper")'
1919
- name: CompatHelper.main()

.github/workflows/benchmarks.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
- uses: actions/checkout@v2
1111
- uses: julia-actions/setup-julia@latest
1212
with:
13-
version: 1.3
13+
version: 1.4
1414
- name: Install dependencies
1515
run: julia -e 'using Pkg; pkg"add PkgBenchmark Distances StatsBase BenchmarkTools [email protected]"'
1616
- name: Run benchmarks

.travis.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ os:
55
- osx
66
julia:
77
- 1.3
8+
- 1.4
89
- nightly
910
after_success:
1011
- julia -e 'using Pkg; Pkg.add("Coverage"); using Coverage; Coveralls.submit(process_folder())'
@@ -14,7 +15,7 @@ jobs:
1415
fast_finish: true
1516
include:
1617
- stage: Documentation
17-
julia: 1.3
18+
julia: 1.4
1819
script: julia --project=docs -e '
1920
using Pkg;
2021
Pkg.develop(PackageSpec(path=pwd()));

Project.toml

+2-1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ julia = "1.3"
1313
[extras]
1414
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
1515
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
16+
Suppressor = "fd094767-a336-5f1f-9728-57cf17d0bbfb"
1617

1718
[targets]
18-
test = ["Test", "Random"]
19+
test = ["Test", "Random", "Suppressor"]

README.md

+10-54
Original file line numberDiff line numberDiff line change
@@ -10,39 +10,32 @@ ________________________________________________________________________________
1010
_________________________________________________________________________________________________________
1111

1212
## Table Of Content
13-
14-
1. [Motivation](#Motivatiion)
13+
1. [Documentation](#Documentation)
1514
2. [Installation](#Installation)
1615
3. [Features](#Features)
17-
4. [Benchmarks](#Benchmarks)
18-
5. [Pending Features](#Pending-Features)
19-
6. [How To Use](#How-To-Use)
20-
7. [Release History](#Release-History)
21-
8. [How To Contribute](#How-To-Contribute)
22-
9. [Credits](#Credits)
23-
10. [License](#License)
16+
4. [License](#License)
2417

2518
_________________________________________________________________________________________________________
2619

27-
### Motivation
28-
It's a funny story actually led to the development of this package.
29-
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after into a heated discussion on the Julia Discourse forums after I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey Oskin offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world.
20+
### Documentation
21+
- Stable Documentation: [![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/stable)
22+
23+
- Experimental Documentation: [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/dev)
3024

31-
Say hello to our baby, `ParallelKMeans`!
3225
_________________________________________________________________________________________________________
3326

3427
### Installation
3528
You can grab the latest stable version of this package by simply running in Julia.
3629
Don't forget to Julia's package manager with `]`
3730

3831
```julia
39-
pkg> add TextAnalysis
32+
pkg> add ParallelKMeans
4033
```
4134

4235
For the few (and selected) brave ones, one can simply grab the current experimental features by simply adding the experimental branch to your development environment after invoking the package manager with `]`:
4336

4437
```julia
45-
dev git@github.com:PyDataBlog/ParallelKMeans.jl.git
38+
pkg> dev git@github.com:PyDataBlog/ParallelKMeans.jl.git
4639
```
4740

4841
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
@@ -54,46 +47,9 @@ ________________________________________________________________________________
5447
### Features
5548

5649
- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
57-
- Support for multi-theading implementation of Kmeans clustering algorithm.
50+
- Support for multi-theading implementation of K-Means clustering algorithm.
5851
- Kmeans++ initialization for faster and better convergence.
59-
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.
60-
61-
_________________________________________________________________________________________________________
62-
63-
### Benchmarks
64-
65-
_________________________________________________________________________________________________________
66-
67-
### Pending Features
68-
- [X] Implementation of Triangle inequality based on [Elkan C. (2003) "Using the Triangle Inequality to Accelerate
69-
K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
70-
- [ ] Support for DataFrame inputs.
71-
- [ ] Refactoring and finalizaiton of API desgin.
72-
- [ ] GPU support.
73-
- [ ] Even faster Kmeans implementation based on current literature.
74-
- [ ] Optimization of code base.
75-
76-
_________________________________________________________________________________________________________
77-
78-
### How To Use
79-
80-
```Julia
81-
82-
```
83-
84-
_________________________________________________________________________________________________________
85-
86-
### Release History
87-
88-
- 0.1.0 Initial release
89-
90-
_________________________________________________________________________________________________________
91-
92-
### How To Contribue
93-
94-
_________________________________________________________________________________________________________
95-
96-
### Credits
52+
- Implementation of all the variants of the K-Means algorithm.
9753

9854
_________________________________________________________________________________________________________
9955

benchmark/bench01_distance.jl

-6
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,6 @@ centroids = rand(10, 2)
1717
d = Vector{Float64}(undef, 100_000)
1818
suite["100kx10"] = @benchmarkable ParallelKMeans.colwise!($d, $X, $centroids)
1919

20-
# for reference
21-
metric = SqEuclidean()
22-
#suite["100kx10_distances"] = @benchmarkable Distances.colwise!($d, $metric, $X, $centroids)
23-
dist = Distances.pairwise(metric, X, centroids, dims = 2)
24-
min = minimum(dist, dims=2)
25-
suite["100kx10_distances"] = @benchmarkable $d = min
2620
end # module
2721

2822
BenchDistance.suite

benchmark/bench02_kmeans.jl

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
module BenchKMeans
2+
using Random
3+
using ParallelKMeans
4+
using BenchmarkTools
5+
6+
suite = BenchmarkGroup()
7+
8+
Random.seed!(2020)
9+
X = rand(10, 100_000)
10+
11+
centroids3 = ParallelKMeans.smart_init(X, 3, 1, init="kmeans++").centroids
12+
centroids10 = ParallelKMeans.smart_init(X, 10, 1, init="kmeans++").centroids
13+
14+
suite["10x100_000x3x1 Lloyd"] = @benchmarkable kmeans($X, 3, init = $centroids3, n_threads = 1, verbose = false, tol = 1e-6, max_iters = 1000)
15+
suite["10x100_000x3x1 Hammerly"] = @benchmarkable kmeans(Hamerly(), $X, 3, init = $centroids3, n_threads = 1, verbose = false, tol = 1e-6, max_iters = 1000)
16+
17+
suite["10x100_000x3x2 Lloyd"] = @benchmarkable kmeans($X, 3, init = $centroids3, n_threads = 2, verbose = false, tol = 1e-6, max_iters = 1000)
18+
suite["10x100_000x3x2 Hammerly"] = @benchmarkable kmeans(Hamerly(), $X, 3, init = $centroids3, n_threads = 2, verbose = false, tol = 1e-6, max_iters = 1000)
19+
20+
suite["10x100_000x10x1 Lloyd"] = @benchmarkable kmeans($X, 10, init = $centroids10, n_threads = 1, verbose = false, tol = 1e-6, max_iters = 1000)
21+
suite["10x100_000x10x1 Hammerly"] = @benchmarkable kmeans(Hamerly(), $X, 10, init = $centroids10, n_threads = 1, verbose = false, tol = 1e-6, max_iters = 1000)
22+
23+
suite["10x100_000x10x2 Lloyd"] = @benchmarkable kmeans($X, 10, init = $centroids10, n_threads = 2, verbose = false, tol = 1e-6, max_iters = 1000)
24+
suite["10x100_000x10x2 Hammerly"] = @benchmarkable kmeans(Hamerly(), $X, 10, init = $centroids10, n_threads = 2, verbose = false, tol = 1e-6, max_iters = 1000)
25+
26+
end # module
27+
28+
BenchKMeans.suite

docs/src/benchmark_image.png

532 KB
Loading

docs/src/index.md

+168-1
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,184 @@
1-
# ParallelKMeans.jl Documentation
1+
# ParallelKMeans.jl Package
22

33
```@contents
4+
Depth = 4
45
```
56

7+
## Motivation
8+
It's actually a funny story led to the development of this package.
9+
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
10+
11+
Say hello to `ParallelKMeans`!
12+
13+
This package aims to utilize the speed of Julia and parallelization (both CPU & GPU) to offer an extremely fast implementation of the K-Means clustering algorithm and its variations via a friendly interface for practioners.
14+
15+
In short, we hope this package will eventually mature as the "one stop" shop for everything KMeans on both CPUs and GPUs.
16+
17+
## K-Means Algorithm Implementation Notes
18+
Since Julia is a column major language, the input (design matrix) expected by the package in the following format;
19+
20+
- Design matrix X of size n×m, the i-th column of X `(X[:, i])` is a single data point in n-dimensional space.
21+
- Thus, the rows of the design design matrix represents the feature space with the columns representing all the training examples in this feature space.
22+
23+
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
24+
This implementation inherits this problem like every implementation does.
25+
As a result, it is useful in practice to restart it several times to get the correct results.
26+
627
## Installation
28+
You can grab the latest stable version of this package from Julia registries by simply running;
729

30+
*NB:* Don't forget to Julia's package manager with `]`
31+
32+
```julia
33+
pkg> add ParallelKMeans
34+
```
35+
36+
For the few (and selected) brave ones, one can simply grab the current experimental features by simply adding the experimental branch to your development environment after invoking the package manager with `]`:
37+
38+
```julia
39+
dev git@github.com:PyDataBlog/ParallelKMeans.jl.git
40+
```
41+
42+
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
43+
```bash
44+
git checkout experimental
45+
```
846

947
## Features
48+
- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
49+
- Support for multi-theading implementation of Kmeans clustering algorithm.
50+
- 'Kmeans++' initialization for faster and better convergence.
51+
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.
52+
53+
54+
## Pending Features
55+
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
56+
- [ ] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
57+
- [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
58+
- [ ] Support for DataFrame inputs.
59+
- [ ] Refactoring and finalizaiton of API desgin.
60+
- [ ] GPU support.
61+
- [ ] Even faster Kmeans implementation based on current literature.
62+
- [ ] Optimization of code base.
63+
- [ ] Improved Documentation
64+
- [ ] More benchmark tests
1065

1166

1267
## How To Use
68+
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.
69+
70+
```julia
71+
using ParallelKMeans
72+
73+
# Uses all available CPU cores by default
74+
multi_results = kmeans(X, 3; max_iters=300)
75+
76+
# Use only 1 core of CPU
77+
results = kmeans(X, 3; n_threads=1, max_iters=300)
78+
```
79+
80+
The main design goal is to offer all available variations of the KMeans algorithm to end users as composable elements. By default, Lloyd's implementation is used but users can specify different variations of the KMeans clustering algorithm via this interface
81+
82+
```julia
83+
some_results = kmeans([algo], input_matrix, k; kwargs)
84+
85+
# example
86+
r = kmeans(Lloyd(), X, 3) # same result as the default
87+
```
88+
89+
```julia
90+
# r contains all the learned artifacts which can be accessed as;
91+
r.centers # cluster centers (d x k)
92+
r.assignments # label assignments (n)
93+
r.totalcost # total cost (i.e. objective)
94+
r.iterations # number of elapsed iterations
95+
r.converged # whether the procedure converged
96+
```
97+
98+
### Supported KMeans algorithm variations.
99+
- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
100+
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
101+
- [Geometric()](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf) - (Coming soon)
102+
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - (Coming soon)
103+
- [MiniBatch()](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) - (Coming soon)
104+
105+
106+
### Practical Usage Examples
107+
Some of the common usage examples of this package are as follows:
108+
109+
#### Clustering With A Desired Number Of Groups
110+
111+
```julia
112+
using ParallelKMeans, RDatasets, Plots
113+
114+
# load the data
115+
iris = dataset("datasets", "iris");
116+
117+
# features to use for clustering
118+
features = collect(Matrix(iris[:, 1:4])');
119+
120+
# various artificats can be accessed from the result ie assigned labels, cost value etc
121+
result = kmeans(features, 3);
122+
123+
# plot with the point color mapped to the assigned cluster index
124+
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
125+
color=:lightrainbow, legend=false)
126+
127+
```
128+
129+
![Image description](iris_example.jpg)
130+
131+
#### Elbow Method For The Selection Of optimal number of clusters
132+
```julia
133+
using ParallelKMeans
134+
135+
# Single Thread Implementation of Lloyd's Algorithm
136+
b = [ParallelKMeans.kmeans(X, i, n_threads=1; tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10]
137+
138+
# Multi Thread Implementation of Lloyd's Algorithm by default
139+
c = [ParallelKMeans.kmeans(X, i; tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10]
140+
141+
```
142+
143+
144+
## Benchmarks
145+
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).
146+
147+
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.
148+
149+
150+
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
151+
152+
153+
### Benchmark Results
154+
155+
![benchmark_image.png](benchmark_image.png)
156+
157+
158+
_________________________________________________________________________________________________________
159+
160+
| 1 million (ms) | 100k (ms) | 10k (ms) | 1k (ms) | package | language |
161+
|:--------------:|:---------:|:--------:|:-------:|:-----------------------:|:--------:|
162+
| 600184.00 | 31959.00 | 832.25 | 18.19 | Clustering.jl | Julia |
163+
| 35733.00 | 4473.00 | 255.71 | 8.94 | Lloyd | Julia |
164+
| 12617.00 | 1655.00 | 122.53 | 7.98 | Hamerly | Julia |
165+
| 1430000.00 | 146000.00 | 5770.00 | 344.00 | Sklearn Kmeans | Python |
166+
| 30100.00 | 3750.00 | 613.00 | 201.00 | Sklearn MiniBatchKmeans | Python |
167+
| 218200.00 | 15510.00 | 733.70 | 19.47 | Knor | R |
168+
169+
_________________________________________________________________________________________________________
170+
171+
172+
## Release History
173+
- 0.1.0 Initial release
174+
175+
176+
## Contributing
177+
Ultimately, we see this package as potentially the one stop shop for everything related to KMeans algorithm and its speed up variants. We are open to new implementations and ideas from anyone interested in this project.
13178

179+
Detailed contribution guidelines will be added in upcoming releases.
14180

181+
<!--- Insert Contribution Guidelines Below --->
15182

16183
```@index
17184
```

docs/src/iris_example.jpg

165 KB
Loading

0 commit comments

Comments
 (0)