NVFP4 for Rust & C++ by ashvardanian · Pull Request #350 · ashvardanian/NumKong

ashvardanian · 2026-04-19T19:48:32Z

No description provided.

Tensors now exchange zero-copy in both directions with PyTorch, NumPy, JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla -- every library implementing the Python Array API DLPack protocol. PEP 3118 and __array_interface__ cannot describe bf16 / fp8 / fp6, so those dtypes round-tripped through NumPy as raw uint bytes and lost their semantic identity. DLPack 1.3 carries kDLBfloat / kDLFloat8_e4m3fn / kDLFloat8_e5m2 / kDLFloat6_* codes -- the dtype now survives the bridge. The importer accepts every device_type whose pointer is host-dereferenceable: plain CPU, pinned host (CUDA / ROCm), cudaMallocManaged unified memory, Intel oneAPI host / shared USM, and Metal on Apple Silicon. Pure device memory (kDLCUDA, kDLROCM, kDLOpenCL, kDLVulkan, kDLWebGPU, kDLHexagon, kDLMAIA, kDLTrn, kDLVPI, kDLExtDev) is rejected with the device code named. The exporter stays strict -- only (kDLCPU, 0) is emitted. Legacy v0 ("dltensor") and versioned v1.3 ("dltensor_versioned") capsules are both supported via the max_version handshake. FP6 silently upgrades to v1 because the IS_SUBBYTE_TYPE_PADDED flag -- required for NumKong's byte-padded layout -- only exists on the versioned struct. Sub-byte types (u1, u4, i4) exchange as byte containers. Consumed capsules are renamed to "used_dltensor[_versioned]" per spec; the producer's deleter runs exactly once via an internal owner that holds the GIL across Py_XDECREF and PyMem_Free, since PyTorch's c10 finalizer invokes deleters without the GIL on CPython 3.12. The exchanged ABI is declared inline (six structs, twelve device codes, version.major checked at runtime) rather than vendoring dmlc/dlpack.h, mirroring NumPy's approach in numpy/_core/src/multiarray/dlpack.c. Per-framework export + import matrices cover every native dtype, including bf16 + fp8_e4m3fn + fp8_e5m2 on PyTorch and bf16 on JAX / TensorFlow / MLX. Three CuPy tests enforce the device-acceptance contract on real hardware (H100): cudaMalloc rejects, cudaMallocManaged accepts, cudaMallocHost accepts. With torch 2.11 / numpy 2.4 / jax 0.10 / tensorflow-cpu 2.21 / pyarrow 23 / cupy-cuda12x 13.6 / onnxruntime 1.24 on Python 3.12: 127 passed, 1 skipped (ONNX RT inference build lacks to_dlpack per microsoft/onnxruntime#23110).

Apply `#pragma GCC optimize("no-tree-vectorize","no-tree-slp-vectorize", "no-ipa-cp-clone","no-inline")` (Clang: `noinline` attribute push) around the serial kernel instantiation regions in dots, mesh, spatials, cast, each, spatial, geospatial, reduce, maxsim, dot, and sparse `*/serial.h` files. Without this, -O3 + LTO cross-TU codegen cloned serial kernels into AVX-512 bodies under dispatch callers, wasting binary and violating the nk_*_serial-as-scalar-oracle contract that tests and numerical-stability docs rely on. Also collapse the `_aligned_` / `_1x8_aligned_` fast-path helpers inside `nk_define_cross_packed_` / `nk_define_cross_compensated_packed_` and the `_galloping_search_` / `_linear_scan_` helpers inside `nk_define_sparse_intersect_` from NK_PUBLIC to NK_INTERNAL so they inline into their dispatchers — removes structural path divergence, not just compiler policy. Portable libnumkong.so: 4.51 MB → 3.56 MB (−930 KB, −21%). Side effect: graniteamx kernels that LTO was silently DCE-ing now survive to link (+38 KB), which was a latent perf regression on Granite Rapids.

ashvardanian added 4 commits April 18, 2026 19:08

Add: Draft block-scaling casts

8916f4d

Add: NVFP4 & MX-blocked formats for C++

cc82fbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFP4 for Rust & C++#350

NVFP4 for Rust & C++#350
ashvardanian wants to merge 4 commits into
main-devfrom
main-v8-block-scaling

ashvardanian commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ashvardanian commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant