Skip to content

SimSIMD πŸͺ¦ β†’ NumKong 🦍 v7 bringing 2'000+ SIMD APIs#220

Merged
ashvardanian merged 1096 commits into
mainfrom
main-elementwise
Mar 17, 2026
Merged

SimSIMD πŸͺ¦ β†’ NumKong 🦍 v7 bringing 2'000+ SIMD APIs#220
ashvardanian merged 1096 commits into
mainfrom
main-elementwise

Conversation

@ashvardanian
Copy link
Copy Markdown
Owner

@ashvardanian ashvardanian commented Oct 31, 2024

What started as a straightforward optimization request from the @albumentations-team β€” improving element-wise operations between equi-dimensional arrays β€” snowballed into the largest piece of open-source work I've done in years. 200K+ lines of SIMD across 2'000+ kernels:

  • targeting every major vector ISA, grouped by platform and shape β€” x86 AVX2 vectors on Haswell, Alder Lake, Sierra Forest Β· x86 AVX-512 vectors on Skylake, Ice Lake, Genoa, Sapphire Rapids, Turin Β· Intel's fixed-shape tiles on Sapphire Rapids, Granite Rapids Β· Arm's fixed-width vectors via NEON, NEONHalf, NEONBFDot, NEONSDot Β· Arm's variable-width vectors via SVE, SVEHalf, SVEBf16, SVE2 Β· Arm's variable-shape tiles via SME, SMEF64, SMEBI32 Β· RISC-V's variable-width vectors via RVV, RVVHalf, RVVBf16, RVVBB Β· WASM's fixed-width vectors via V128Relaxed,
  • going far beyond the original already non-trivial vector-vector angular distances β€” GEMM-like many-to-many dot-products and similarity measures Β· SYRK-like distances within a batch of vectors Β· point-to-point Haversine and Vincenty distances Β· mesh-to-mesh RMSD, Kabsch, & Umeyama distances in 3D Β· doc-to-doc MaxSim & MinHash scoring Β· vector-matrix-vector Bilinear forms and Mahalanobis distances Β· scalar arithmetics, trigonometry, reductions, and type-casting,
  • across a superset of numeric types found in modern CPUs and GPUs β€” bits and nibbles like u1, u4, i4 Β· mini-floats like e2m3, e3m2, e4m3,e5m2 Β· half-precision floats like f16, bf16 Β· classics like i8, u8, f32, f64, and their complex counterparts,
  • implemented in C99, fully exposed to C++23, Python 3, & Rust, with initial support for Swift, JS, and Go!

More importantly for downstream consumers like USearch β€” and by extension, a good chunk of modern DBMS, cloud, and AI products worldwide β€” this release is a massive hardening effort. Some operations may run 5x slower than BLAS implementations like Intel's MKL, but they are also 50–100x more accurate where it matters, employing numerically stable algorithms to avoid catastrophic cancellation in floating-point and to saturate integers correctly.

NumKong+StringZilla banner

A full rewrite 8x the size of the original project naturally came with breaking changes. Most notably, the project is renamed from "SimSIMD" to "NumKong", aligning it with StringZilla. Both projects now share even more DNA β€” from the choice of somewhat archaic design principles to the overall project organization. Among the subtler breaking changes:

  • The "cosine" distance is now called "angular" to avoid confusion with trigonometric element-wise functions.
  • flush_denormals becomes configure_thread and now also enables AMX and SME.
  • DistancesTensor in Python is replaced with Tensor.
  • All kernels now have distinct output types instead of uniformly returning simsimd_distance_t.
  • The ABI for element-wise operations changed to accept scaling factors by pointer.
  • 32-bit systems now use 32-bit integers for indexing...

And more! A detailed write-up with a transition and integration guide is coming soon β€” stay tuned πŸ€—

@ashvardanian ashvardanian changed the title Element-wise BLAS-like APIs Element-wise BLASAPIs & new Tensor for Python Nov 1, 2024
@ashvardanian ashvardanian changed the title Element-wise BLASAPIs & new Tensor for Python Element-wise BLAS APIs & new Tensor for Python Nov 1, 2024
@ashvardanian ashvardanian force-pushed the main-elementwise branch 3 times, most recently from 56f1a5d to 38df49c Compare November 8, 2024 15:33
@ashvardanian ashvardanian changed the title Element-wise BLAS APIs & new Tensor for Python Element-wise BLAS APIs & new Tensor for Python: ⬆️ 450 kernels Nov 9, 2024
@ashvardanian ashvardanian force-pushed the main-elementwise branch 3 times, most recently from ecb475a to e568e6c Compare November 11, 2024 17:58
@ashvardanian ashvardanian force-pushed the main-elementwise branch 2 times, most recently from fbb990f to 9ef0430 Compare December 21, 2025 18:29
@ashvardanian ashvardanian changed the title Element-wise BLAS APIs & new Tensor for Python: ⬆️ 450 kernels v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs Dec 27, 2025
@ashvardanian ashvardanian changed the title v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on CPUs Jan 25, 2026
@ashvardanian ashvardanian changed the title v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on CPUs v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on x86, Arm, and RISC-V CPUs Feb 6, 2026
This was referenced Feb 17, 2026
@ashvardanian ashvardanian linked an issue Feb 17, 2026 that may be closed by this pull request
3 tasks
@ashvardanian ashvardanian changed the title v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on x86, Arm, and RISC-V CPUs v7: ⬆️ 1,500+ Kernels for Mixed Precision Numerics on x86, Arm, and RISC-V CPUs Feb 24, 2026
@markusalbertgraf
Copy link
Copy Markdown
Contributor

Wow!!!
Trying to build on FreeBSD to check as mentioned in #300.

Maybe there is some code not yet checked in ;-)

[ 46%] Building CXX object CMakeFiles/nk_test.dir/test/test_each.cpp.o
In file included from c/SimSIMD/test/test_each.cpp:9:
c/SimSIMD/include/numkong/each.hpp:48:52: error: use of undeclared identifier 'saturating_add'
48 | for (std::size_t i = 0; i < d; i++) c[i] = saturating_add(a[i], b[i]);

@ashvardanian ashvardanian changed the title SimSIMD πŸͺ¦ β†’ NumKong 🦍 v7 bringing 1,900+ SIMD APIs SimSIMD πŸͺ¦ β†’ NumKong 🦍 v7 bringing 2'000+ SIMD APIs Mar 17, 2026
This brings the current state of the project to 2'000
kernels including both SIMD and serial endpoints.
MSVC 19.44 optimizer drops {k} register values replacing masked stores
with normal ones, producing out-of-bounds writes for snippets like:
```
__m256i result = _mm512_cvtepi64_epi32(clamped);
_mm256_mask_storeu_epi32(to_ptr, mask, result);
```
The produced code looks like:
```
vpmovqd YMMWORD PTR [r9], zmm3
```
Instead of:
```
vpmovqd YMMWORD PTR [r9] {k1}, zmm3
```
@ashvardanian ashvardanian merged commit c7cdcda into main Mar 17, 2026
82 of 83 checks passed
@lin72h
Copy link
Copy Markdown

lin72h commented Mar 17, 2026

Congratulations on this long time effort.

@ashvardanian
Copy link
Copy Markdown
Owner Author

Thank you! Started it in '23, opened the PR in late '24, and finally pushed it through the line with a few months of concentrated "stay-home" effort from the end of '25 till March '26. And there are 2 larger projects I still hope to ship this spring unless my 🧠πŸ’₯... πŸ˜…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants