Skip to content

NVFP4 for Rust & C++#350

Open
ashvardanian wants to merge 4 commits into
main-devfrom
main-v8-block-scaling
Open

NVFP4 for Rust & C++#350
ashvardanian wants to merge 4 commits into
main-devfrom
main-v8-block-scaling

Conversation

@ashvardanian
Copy link
Copy Markdown
Owner

No description provided.

Tensors now exchange zero-copy in both directions with PyTorch, NumPy,
JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla --
every library implementing the Python Array API DLPack protocol.
PEP 3118 and __array_interface__ cannot describe bf16 / fp8 / fp6, so
those dtypes round-tripped through NumPy as raw uint bytes and lost
their semantic identity. DLPack 1.3 carries kDLBfloat / kDLFloat8_e4m3fn
/ kDLFloat8_e5m2 / kDLFloat6_* codes -- the dtype now survives the bridge.

The importer accepts every device_type whose pointer is host-dereferenceable:
plain CPU, pinned host (CUDA / ROCm), cudaMallocManaged unified memory,
Intel oneAPI host / shared USM, and Metal on Apple Silicon. Pure device
memory (kDLCUDA, kDLROCM, kDLOpenCL, kDLVulkan, kDLWebGPU, kDLHexagon,
kDLMAIA, kDLTrn, kDLVPI, kDLExtDev) is rejected with the device code
named. The exporter stays strict -- only (kDLCPU, 0) is emitted.

Legacy v0 ("dltensor") and versioned v1.3 ("dltensor_versioned") capsules
are both supported via the max_version handshake. FP6 silently upgrades
to v1 because the IS_SUBBYTE_TYPE_PADDED flag -- required for NumKong's
byte-padded layout -- only exists on the versioned struct. Sub-byte
types (u1, u4, i4) exchange as byte containers. Consumed capsules are
renamed to "used_dltensor[_versioned]" per spec; the producer's deleter
runs exactly once via an internal owner that holds the GIL across
Py_XDECREF and PyMem_Free, since PyTorch's c10 finalizer invokes
deleters without the GIL on CPython 3.12.

The exchanged ABI is declared inline (six structs, twelve device codes,
version.major checked at runtime) rather than vendoring dmlc/dlpack.h,
mirroring NumPy's approach in numpy/_core/src/multiarray/dlpack.c.

Per-framework export + import matrices cover every native dtype,
including bf16 + fp8_e4m3fn + fp8_e5m2 on PyTorch and bf16 on JAX /
TensorFlow / MLX. Three CuPy tests enforce the device-acceptance
contract on real hardware (H100): cudaMalloc rejects, cudaMallocManaged
accepts, cudaMallocHost accepts. With torch 2.11 / numpy 2.4 / jax 0.10
/ tensorflow-cpu 2.21 / pyarrow 23 / cupy-cuda12x 13.6 / onnxruntime
1.24 on Python 3.12: 127 passed, 1 skipped (ONNX RT inference build
lacks to_dlpack per microsoft/onnxruntime#23110).
Apply `#pragma GCC optimize("no-tree-vectorize","no-tree-slp-vectorize",
"no-ipa-cp-clone","no-inline")` (Clang: `noinline` attribute push) around
the serial kernel instantiation regions in dots, mesh, spatials, cast, each,
spatial, geospatial, reduce, maxsim, dot, and sparse `*/serial.h` files.
Without this, -O3 + LTO cross-TU codegen cloned serial kernels into
AVX-512 bodies under dispatch callers, wasting binary and violating the
nk_*_serial-as-scalar-oracle contract that tests and numerical-stability
docs rely on.

Also collapse the `_aligned_` / `_1x8_aligned_` fast-path helpers inside
`nk_define_cross_packed_` / `nk_define_cross_compensated_packed_` and the
`_galloping_search_` / `_linear_scan_` helpers inside
`nk_define_sparse_intersect_` from NK_PUBLIC to NK_INTERNAL so they inline
into their dispatchers — removes structural path divergence, not just
compiler policy.

Portable libnumkong.so: 4.51 MB → 3.56 MB (−930 KB, −21%). Side effect:
graniteamx kernels that LTO was silently DCE-ing now survive to link
(+38 KB), which was a latent perf regression on Granite Rapids.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant