NVFP4 for Rust & C++#350
Open
ashvardanian wants to merge 4 commits into
Open
Conversation
Tensors now exchange zero-copy in both directions with PyTorch, NumPy,
JAX, CuPy, TensorFlow, PyArrow, MLX, ONNX Runtime, TVM, MXNet, NNabla --
every library implementing the Python Array API DLPack protocol.
PEP 3118 and __array_interface__ cannot describe bf16 / fp8 / fp6, so
those dtypes round-tripped through NumPy as raw uint bytes and lost
their semantic identity. DLPack 1.3 carries kDLBfloat / kDLFloat8_e4m3fn
/ kDLFloat8_e5m2 / kDLFloat6_* codes -- the dtype now survives the bridge.
The importer accepts every device_type whose pointer is host-dereferenceable:
plain CPU, pinned host (CUDA / ROCm), cudaMallocManaged unified memory,
Intel oneAPI host / shared USM, and Metal on Apple Silicon. Pure device
memory (kDLCUDA, kDLROCM, kDLOpenCL, kDLVulkan, kDLWebGPU, kDLHexagon,
kDLMAIA, kDLTrn, kDLVPI, kDLExtDev) is rejected with the device code
named. The exporter stays strict -- only (kDLCPU, 0) is emitted.
Legacy v0 ("dltensor") and versioned v1.3 ("dltensor_versioned") capsules
are both supported via the max_version handshake. FP6 silently upgrades
to v1 because the IS_SUBBYTE_TYPE_PADDED flag -- required for NumKong's
byte-padded layout -- only exists on the versioned struct. Sub-byte
types (u1, u4, i4) exchange as byte containers. Consumed capsules are
renamed to "used_dltensor[_versioned]" per spec; the producer's deleter
runs exactly once via an internal owner that holds the GIL across
Py_XDECREF and PyMem_Free, since PyTorch's c10 finalizer invokes
deleters without the GIL on CPython 3.12.
The exchanged ABI is declared inline (six structs, twelve device codes,
version.major checked at runtime) rather than vendoring dmlc/dlpack.h,
mirroring NumPy's approach in numpy/_core/src/multiarray/dlpack.c.
Per-framework export + import matrices cover every native dtype,
including bf16 + fp8_e4m3fn + fp8_e5m2 on PyTorch and bf16 on JAX /
TensorFlow / MLX. Three CuPy tests enforce the device-acceptance
contract on real hardware (H100): cudaMalloc rejects, cudaMallocManaged
accepts, cudaMallocHost accepts. With torch 2.11 / numpy 2.4 / jax 0.10
/ tensorflow-cpu 2.21 / pyarrow 23 / cupy-cuda12x 13.6 / onnxruntime
1.24 on Python 3.12: 127 passed, 1 skipped (ONNX RT inference build
lacks to_dlpack per microsoft/onnxruntime#23110).
Apply `#pragma GCC optimize("no-tree-vectorize","no-tree-slp-vectorize",
"no-ipa-cp-clone","no-inline")` (Clang: `noinline` attribute push) around
the serial kernel instantiation regions in dots, mesh, spatials, cast, each,
spatial, geospatial, reduce, maxsim, dot, and sparse `*/serial.h` files.
Without this, -O3 + LTO cross-TU codegen cloned serial kernels into
AVX-512 bodies under dispatch callers, wasting binary and violating the
nk_*_serial-as-scalar-oracle contract that tests and numerical-stability
docs rely on.
Also collapse the `_aligned_` / `_1x8_aligned_` fast-path helpers inside
`nk_define_cross_packed_` / `nk_define_cross_compensated_packed_` and the
`_galloping_search_` / `_linear_scan_` helpers inside
`nk_define_sparse_intersect_` from NK_PUBLIC to NK_INTERNAL so they inline
into their dispatchers — removes structural path divergence, not just
compiler policy.
Portable libnumkong.so: 4.51 MB → 3.56 MB (−930 KB, −21%). Side effect:
graniteamx kernels that LTO was silently DCE-ing now survive to link
(+38 KB), which was a latent perf regression on Granite Rapids.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.