SimSIMD 🪦 → NumKong 🦍 v7 bringing 2'000+ SIMD APIs by ashvardanian · Pull Request #220 · ashvardanian/NumKong

ashvardanian · 2024-10-31T23:12:22Z

What started as a straightforward optimization request from the @albumentations-team — improving element-wise operations between equi-dimensional arrays — snowballed into the largest piece of open-source work I've done in years. 200K+ lines of SIMD across 2'000+ kernels:

targeting every major vector ISA, grouped by platform and shape — x86 AVX2 vectors on Haswell, Alder Lake, Sierra Forest · x86 AVX-512 vectors on Skylake, Ice Lake, Genoa, Sapphire Rapids, Turin · Intel's fixed-shape tiles on Sapphire Rapids, Granite Rapids · Arm's fixed-width vectors via NEON, NEONHalf, NEONBFDot, NEONSDot · Arm's variable-width vectors via SVE, SVEHalf, SVEBf16, SVE2 · Arm's variable-shape tiles via SME, SMEF64, SMEBI32 · RISC-V's variable-width vectors via RVV, RVVHalf, RVVBf16, RVVBB · WASM's fixed-width vectors via V128Relaxed,
going far beyond the original already non-trivial vector-vector angular distances — GEMM-like many-to-many dot-products and similarity measures · SYRK-like distances within a batch of vectors · point-to-point Haversine and Vincenty distances · mesh-to-mesh RMSD, Kabsch, & Umeyama distances in 3D · doc-to-doc MaxSim & MinHash scoring · vector-matrix-vector Bilinear forms and Mahalanobis distances · scalar arithmetics, trigonometry, reductions, and type-casting,
across a superset of numeric types found in modern CPUs and GPUs — bits and nibbles like u1, u4, i4 · mini-floats like e2m3, e3m2, e4m3,e5m2 · half-precision floats like f16, bf16 · classics like i8, u8, f32, f64, and their complex counterparts,
implemented in C99, fully exposed to C++23, Python 3, & Rust, with initial support for Swift, JS, and Go!

More importantly for downstream consumers like USearch — and by extension, a good chunk of modern DBMS, cloud, and AI products worldwide — this release is a massive hardening effort. Some operations may run 5x slower than BLAS implementations like Intel's MKL, but they are also 50–100x more accurate where it matters, employing numerically stable algorithms to avoid catastrophic cancellation in floating-point and to saturate integers correctly.

A full rewrite 8x the size of the original project naturally came with breaking changes. Most notably, the project is renamed from "SimSIMD" to "NumKong", aligning it with StringZilla. Both projects now share even more DNA — from the choice of somewhat archaic design principles to the overall project organization. Among the subtler breaking changes:

The "cosine" distance is now called "angular" to avoid confusion with trigonometric element-wise functions.
flush_denormals becomes configure_thread and now also enables AMX and SME.
DistancesTensor in Python is replaced with Tensor.
All kernels now have distinct output types instead of uniformly returning simsimd_distance_t.
The ABI for element-wise operations changed to accept scaling factors by pointer.
32-bit systems now use 32-bit integers for indexing...

And more! A detailed write-up with a transition and integration guide is coming soon — stay tuned 🤗

markusalbertgraf · 2026-03-01T19:46:23Z

Wow!!!
Trying to build on FreeBSD to check as mentioned in #300.

Maybe there is some code not yet checked in ;-)

[ 46%] Building CXX object CMakeFiles/nk_test.dir/test/test_each.cpp.o
In file included from c/SimSIMD/test/test_each.cpp:9:
c/SimSIMD/include/numkong/each.hpp:48:52: error: use of undeclared identifier 'saturating_add'
48 | for (std::size_t i = 0; i < d; i++) c[i] = saturating_add(a[i], b[i]);

Alder Lake spatial kernels: ~11.5 GB/s. Skylake: Skylake's ~5.7-6.2 GB/s. Haswell: ~4.5-4.8 GB/s.

Range: 13477ba..b0f00af

This brings the current state of the project to 2'000 kernels including both SIMD and serial endpoints.

MSVC 19.44 optimizer drops {k} register values replacing masked stores with normal ones, producing out-of-bounds writes for snippets like: ``` __m256i result = _mm512_cvtepi64_epi32(clamped); _mm256_mask_storeu_epi32(to_ptr, mask, result); ``` The produced code looks like: ``` vpmovqd YMMWORD PTR [r9], zmm3 ``` Instead of: ``` vpmovqd YMMWORD PTR [r9] {k1}, zmm3 ```

lin72h · 2026-03-17T23:10:56Z

Congratulations on this long time effort.

ashvardanian · 2026-03-18T00:04:05Z

Thank you! Started it in '23, opened the PR in late '24, and finally pushed it through the line with a few months of concentrated "stay-home" effort from the end of '25 till March '26. And there are 2 larger projects I still hope to ship this spring unless my 🧠💥... 😅

ashvardanian changed the title ~~Element-wise BLAS-like APIs~~ Element-wise BLASAPIs & new Tensor for Python Nov 1, 2024

ashvardanian changed the title ~~Element-wise BLASAPIs & new Tensor for Python~~ Element-wise BLAS APIs & new Tensor for Python Nov 1, 2024

ashvardanian force-pushed the main-elementwise branch 3 times, most recently from 56f1a5d to 38df49c Compare November 8, 2024 15:33

ashvardanian changed the title ~~Element-wise BLAS APIs & new Tensor for Python~~ Element-wise BLAS APIs & new Tensor for Python: ⬆️ 450 kernels Nov 9, 2024

ashvardanian force-pushed the main-elementwise branch 3 times, most recently from ecb475a to e568e6c Compare November 11, 2024 17:58

ashvardanian mentioned this pull request Nov 17, 2024

Fix: Refs #207 Fix implementation of Jensen Shannon measure #233

Merged

ashvardanian force-pushed the main-elementwise branch 2 times, most recently from fbb990f to 9ef0430 Compare December 21, 2025 18:29

ashvardanian changed the title ~~Element-wise BLAS APIs & new Tensor for Python: ⬆️ 450 kernels~~ v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs Dec 27, 2025

ashvardanian force-pushed the main-elementwise branch from a16ba76 to 4f33520 Compare January 5, 2026 22:40

ashvardanian mentioned this pull request Jan 8, 2026

Bug: Integer underflow in Index::size() during concurrent add/remove. unum-cloud/USearch#697

Closed

3 tasks

ashvardanian changed the title ~~v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs~~ v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on CPUs Jan 25, 2026

ashvardanian changed the title ~~v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on CPUs~~ v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on x86, Arm, and RISC-V CPUs Feb 6, 2026

ashvardanian mentioned this pull request Feb 13, 2026

Feature: More accurate memory statistics unum-cloud/USearch#608

Closed

3 tasks

This was linked to issues Feb 17, 2026

Feature: mean of the array np.mean => simsimd.mean #248

Closed

Detect Overflows #206

Closed

Feature: How is WASM Compiled? Are you using wasm_simd128 and msimd128 ? #189

Closed

AMX support for tiled matrix multiplications #26

Closed

This was referenced Feb 17, 2026

Sparse Distances #100

Open

GoLang bindings #28

Open

ashvardanian linked an issue Feb 17, 2026 that may be closed by this pull request

Feature: Add high level OP for MaxSim calculation #272

Closed

3 tasks

ashvardanian changed the title ~~v7: ⬆️ 1,000+ Kernels for Mixed Precision Numerics on x86, Arm, and RISC-V CPUs~~ v7: ⬆️ 1,500+ Kernels for Mixed Precision Numerics on x86, Arm, and RISC-V CPUs Feb 24, 2026

ashvardanian mentioned this pull request Feb 28, 2026

Add FreeBSD to target selection. #300

Merged

ashvardanian force-pushed the main-elementwise branch from 55f6463 to 6dcddbc Compare March 3, 2026 20:14

ashvardanian mentioned this pull request Mar 4, 2026

Initial WASM support. #242

Closed

Merge: Harden C++ SDK & cast logic

47ecd23

ashvardanian force-pushed the main-elementwise branch from 4b75bf1 to 47ecd23 Compare March 16, 2026 18:42

ashvardanian added 17 commits March 16, 2026 19:23

Make: Randomize PyTests in CI

13477ba

Fix: Unused variables & designated initializers

2fe1881

Make: Adjust probing for cross-compilation

ffd1f7b

Fix: Prioritize Ice Lake over Alder Lake

e45a023

Fix: Refresh dispatch tables after changing static caps

c3535e3

Make: Shuffle PyTests in CI

b0f00af

Fix: Wire kernels into runtime dispatch

8b2236d

Add: e3m2 Alder spatial kernels

bf2ba46

Alder Lake spatial kernels: ~11.5 GB/s. Skylake: Skylake's ~5.7-6.2 GB/s. Haswell: ~4.5-4.8 GB/s.

Merge: PyTest caps change & Alder kernels

b362cda

Range: 13477ba..b0f00af

Fix: Success code for non-x86/Arm thread configuration

c118f3a

Make: Prevent constant folding in probes

ed918e2

Fix: Wire missing tests/benchmarks

581aeaa

Improve: MXCSR-independent rounding in trignometry

f8c444f

Fix: Mantissa overflow in F32 → F16 rouning in RVV

cc4e5b4

Fix: reduce_minmax strict aliasing

67fa023

Make: Lower base RISC-V generation for Python wheels

b7b013f

Fix: Wire each/, mesh/, curved/ RVV kernels

0a152c3

ashvardanian changed the title ~~SimSIMD 🪦 → NumKong 🦍 v7 bringing 1,900+ SIMD APIs~~ SimSIMD 🪦 → NumKong 🦍 v7 bringing 2'000+ SIMD APIs Mar 17, 2026

ashvardanian added 2 commits March 17, 2026 13:30

Fix: Wire remaining kernels in 5 places each

eea2ffb

This brings the current state of the project to 2'000 kernels including both SIMD and serial endpoints.

Make: Drop RISC-V wheels until Clang 23

bd49085

ashvardanian force-pushed the main-elementwise branch from 3a293e9 to bd49085 Compare March 17, 2026 14:09

ashvardanian added 2 commits March 17, 2026 15:12

Make: Consistent CMake toolchain flags

c767dc6

ashvardanian merged commit c7cdcda into main Mar 17, 2026
82 of 83 checks passed

This was referenced Mar 24, 2026

add numkong output to simsimd conda-forge/admin-requests#1957

Closed

add numkong output to simsimd conda-forge/admin-requests#1958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimSIMD 🪦 → NumKong 🦍 v7 bringing 2'000+ SIMD APIs#220

SimSIMD 🪦 → NumKong 🦍 v7 bringing 2'000+ SIMD APIs#220
ashvardanian merged 1096 commits into
mainfrom
main-elementwise

ashvardanian commented Oct 31, 2024 •

edited

Loading

Uh oh!

markusalbertgraf commented Mar 1, 2026

Uh oh!

Uh oh!

lin72h commented Mar 17, 2026

Uh oh!

ashvardanian commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ashvardanian commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markusalbertgraf commented Mar 1, 2026

Uh oh!

Uh oh!

lin72h commented Mar 17, 2026

Uh oh!

ashvardanian commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ashvardanian commented Oct 31, 2024 •

edited

Loading