SimSIMD πͺ¦ β NumKong π¦ v7 bringing 2'000+ SIMD APIs#220
Merged
Conversation
56f1a5d to
38df49c
Compare
ecb475a to
e568e6c
Compare
fbb990f to
9ef0430
Compare
a16ba76 to
4f33520
Compare
3 tasks
3 tasks
This was
linked to
issues
Feb 17, 2026
Closed
3 tasks
Contributor
|
Wow!!! Maybe there is some code not yet checked in ;-) [ 46%] Building CXX object CMakeFiles/nk_test.dir/test/test_each.cpp.o |
55f6463 to
6dcddbc
Compare
Closed
4b75bf1 to
47ecd23
Compare
Alder Lake spatial kernels: ~11.5 GB/s. Skylake: Skylake's ~5.7-6.2 GB/s. Haswell: ~4.5-4.8 GB/s.
This brings the current state of the project to 2'000 kernels including both SIMD and serial endpoints.
3a293e9 to
bd49085
Compare
MSVC 19.44 optimizer drops {k} register values replacing masked stores
with normal ones, producing out-of-bounds writes for snippets like:
```
__m256i result = _mm512_cvtepi64_epi32(clamped);
_mm256_mask_storeu_epi32(to_ptr, mask, result);
```
The produced code looks like:
```
vpmovqd YMMWORD PTR [r9], zmm3
```
Instead of:
```
vpmovqd YMMWORD PTR [r9] {k1}, zmm3
```
|
Congratulations on this long time effort. |
Owner
Author
|
Thank you! Started it in '23, opened the PR in late '24, and finally pushed it through the line with a few months of concentrated "stay-home" effort from the end of '25 till March '26. And there are 2 larger projects I still hope to ship this spring unless my π§ π₯... π |
This was referenced Mar 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What started as a straightforward optimization request from the @albumentations-team β improving element-wise operations between equi-dimensional arrays β snowballed into the largest piece of open-source work I've done in years. 200K+ lines of SIMD across 2'000+ kernels:
u1,u4,i4Β· mini-floats likee2m3,e3m2,e4m3,e5m2Β· half-precision floats likef16,bf16Β· classics likei8,u8,f32,f64, and their complex counterparts,More importantly for downstream consumers like USearch β and by extension, a good chunk of modern DBMS, cloud, and AI products worldwide β this release is a massive hardening effort. Some operations may run 5x slower than BLAS implementations like Intel's MKL, but they are also 50β100x more accurate where it matters, employing numerically stable algorithms to avoid catastrophic cancellation in floating-point and to saturate integers correctly.
A full rewrite 8x the size of the original project naturally came with breaking changes. Most notably, the project is renamed from "SimSIMD" to "NumKong", aligning it with StringZilla. Both projects now share even more DNA β from the choice of somewhat archaic design principles to the overall project organization. Among the subtler breaking changes:
flush_denormalsbecomesconfigure_threadand now also enables AMX and SME.DistancesTensorin Python is replaced withTensor.simsimd_distance_t.And more! A detailed write-up with a transition and integration guide is coming soon β stay tuned π€