Skip to content

fix(gpu-libs): bundle hipBLASLt TensileLibrary data so ROCm backends stop falling back (#10660)#10672

Merged
mudler merged 1 commit into
masterfrom
worktree-fix+hipblaslt-tensile-data-10660
Jul 4, 2026
Merged

fix(gpu-libs): bundle hipBLASLt TensileLibrary data so ROCm backends stop falling back (#10660)#10672
mudler merged 1 commit into
masterfrom
worktree-fix+hipblaslt-tensile-data-10660

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Description

Fixes #10660.

The ROCm packager bundled rocBLAS kernel data (rocblas/library/*.dat) into each backend's lib/ dir and run.sh pointed ROCBLAS_TENSILE_LIBPATH at it, but the parallel hipBLASLt data dir (hipblaslt/library/TensileLibrary_lazy_gfx*.dat) was never packaged and no HIPBLASLT_TENSILE_LIBPATH was set.

Because backends ship their own libhipblaslt.so under lib/ (prioritized via LD_LIBRARY_PATH), that lib resolved its per-arch kernel data relative to itself, found nothing, and silently fell back to slow generic kernels, logging:

rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory
rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat"

The reporter's attached log confirms the asymmetry: ROCBLAS_TENSILE_LIBPATH=/backends/rocm-llama-cpp/lib/rocblas/library is set and working, while hipBLASLt has no bundled data and no env var. This affected every gfx arch, not just gfx1201.

Fix

Mirrors the existing rocBLAS handling:

  • scripts/build/package-gpu-libs.sh — extract the rocblas data-dir copy into a reusable copy_rocm_data_dir helper and call it for both rocblas and hipblaslt. (Also keeps the deliberate single-line local x=$(shopt -p nullglob) idiom to avoid tripping set -e when nullglob is unset.)
  • backend/cpp/llama-cpp/run.sh + backend/cpp/turboquant/run.sh — export HIPBLASLT_TENSILE_LIBPATH when the bundled hipblaslt/library dir exists (the only two backends with the rocBLAS pattern).

Tests

New regression test scripts/build/package-gpu-libs-rocm-data_test.sh runs package_rocm_libs against a fabricated ROCm tree (via a new ROCM_BASE_DIRS override) and asserts both rocblas/ and hipblaslt/ data dirs are bundled. Developed TDD: it fails on the pre-fix code (rocblas bundled, hipblaslt not) and passes after. The existing package-gpu-libs_test.sh (#10537) still passes.

Caveat

This bundles whatever gfx*.dat the build image's ROCm actually provides. gfx1201/RDNA4 tensile data landed in ROCm 6.4 — if the shipped ROCm predates it, that specific arch would still need a ROCm bump. The packaging gap itself is fixed for every supported arch. Verified at the unit level; not driven end-to-end on gfx1201 hardware.

Notes

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…stop falling back (#10660)

The ROCm packager copied rocBLAS kernel data (rocblas/library/*.dat) into the
bundled lib/ dir and run.sh pointed ROCBLAS_TENSILE_LIBPATH at it, but the
parallel hipBLASLt data dir (hipblaslt/library/TensileLibrary_lazy_gfx*.dat)
was never packaged and no HIPBLASLT_TENSILE_LIBPATH was set. The bundled
libhipblaslt.so therefore resolved its per-arch kernel data relative to itself,
found nothing, and silently fell back to slow generic kernels, logging:

    rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory
    rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat"

Fix, mirroring the existing rocBLAS handling:
- package-gpu-libs.sh: extract the rocblas data-dir copy into a reusable
  copy_rocm_data_dir helper and call it for both rocblas and hipblaslt.
- llama-cpp/turboquant run.sh: export HIPBLASLT_TENSILE_LIBPATH when the
  bundled hipblaslt/library dir exists.

The helper takes an optional ROCM_BASE_DIRS override so the copy is unit
testable without a real ROCm install; add a regression test that runs
package_rocm_libs against a fabricated ROCm tree and asserts both data dirs
are bundled.

Note: this bundles whatever gfx*.dat the build image's ROCm provides. If a
given arch's tensile data is absent from the shipped ROCm, that arch still
needs a ROCm bump; the packaging gap itself is fixed for every supported arch.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler added the bug Something isn't working label Jul 3, 2026
@mudler mudler merged commit 348f3c8 into master Jul 4, 2026
80 checks passed
@mudler mudler deleted the worktree-fix+hipblaslt-tensile-data-10660 branch July 4, 2026 06:14
l0caldadmin added a commit to l0caldadmin/LocalAI that referenced this pull request Jul 4, 2026
* fix(gpu-libs): bundle hipBLASLt TensileLibrary data so ROCm backends stop falling back (mudler#10660) (mudler#10672) the 

The ROCm packager copied rocBLAS kernel data (rocblas/library/*.dat) into the
bundled lib/ dir and run.sh pointed ROCBLAS_TENSILE_LIBPATH at it, but the
parallel hipBLASLt data dir (hipblaslt/library/TensileLibrary_lazy_gfx*.dat)
was never packaged and no HIPBLASLT_TENSILE_LIBPATH was set. The bundled
libhipblaslt.so therefore resolved its per-arch kernel data relative to itself,
found nothing, and silently fell back to slow generic kernels, logging:

    rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory
    rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat"

Fix, mirroring the existing rocBLAS handling:
- package-gpu-libs.sh: extract the rocblas data-dir copy into a reusable
  copy_rocm_data_dir helper and call it for both rocblas and hipblaslt.
- llama-cpp/turboquant run.sh: export HIPBLASLT_TENSILE_LIBPATH when the
  bundled hipblaslt/library dir exists.

The helper takes an optional ROCM_BASE_DIRS override so the copy is unit
testable without a real ROCm install; add a regression test that runs
package_rocm_libs against a fabricated ROCm tree and asserts both data dirs
are bundled.

Note: this bundles whatever gfx*.dat the build image's ROCm provides. If a
given arch's tensile data is absent from the shipped ROCm, that arch still
needs a ROCm bump; the packaging gap itself is fixed for every supported arch.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

* chore: ⬆️ Update ggml-org/llama.cpp to `d4cff114c0084f1fbc9b4c62717eca8fb2ae494a` (mudler#10671)

:arrow_up: Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* chore: :arrow_up: Update CrispStrobe/CrispASR to `f35185b876fc482fcb2053a81a2697936ed5fcc0` (mudler#10670)

:arrow_up: Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* fix(backends): enable ROCm/HIP GPU offload for ggml audio backends (mudler#10666) (mudler#10667)

qwen3-tts-cpp, omnivoice-cpp, acestep-cpp and vibevoice-cpp shipped
rocm-* variants that silently ran on CPU ([Load] backend: CPU). Two
coupled defects:

- The Makefiles passed -DGGML_HIPBLAS=ON, but the vendored ggml only
  understands -DGGML_HIP=ON (GGML_HIPBLAS was removed upstream), so the
  ggml-hip backend target was never created and no GPU code was built.
- The CMake foreach that links the ggml GPU backends into the module
  listed blas/cuda/metal/vulkan but not hip, so even a built ggml-hip
  would not have been linked and its static backend registration would
  never run.

CUDA users were unaffected because cublas passes the correct GGML_CUDA=ON
and the foreach already links cuda. Mirror the proven llama-cpp hipblas
block (ROCm clang CC/CXX + AMDGPU_TARGETS) and add hip to each foreach.
Upstream picks the best device via ggml_backend_init_best(), so no
runtime flag is needed once HIP is compiled and linked.


Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: LocalAI [bot] <139863280+localai-bot@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ROCM Llama CPP backend missing file

2 participants