fix(gpu-libs): bundle hipBLASLt TensileLibrary data so ROCm backends stop falling back (#10660)#10672
Merged
Merged
Conversation
…stop falling back (#10660) The ROCm packager copied rocBLAS kernel data (rocblas/library/*.dat) into the bundled lib/ dir and run.sh pointed ROCBLAS_TENSILE_LIBPATH at it, but the parallel hipBLASLt data dir (hipblaslt/library/TensileLibrary_lazy_gfx*.dat) was never packaged and no HIPBLASLT_TENSILE_LIBPATH was set. The bundled libhipblaslt.so therefore resolved its per-arch kernel data relative to itself, found nothing, and silently fell back to slow generic kernels, logging: rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat" Fix, mirroring the existing rocBLAS handling: - package-gpu-libs.sh: extract the rocblas data-dir copy into a reusable copy_rocm_data_dir helper and call it for both rocblas and hipblaslt. - llama-cpp/turboquant run.sh: export HIPBLASLT_TENSILE_LIBPATH when the bundled hipblaslt/library dir exists. The helper takes an optional ROCM_BASE_DIRS override so the copy is unit testable without a real ROCm install; add a regression test that runs package_rocm_libs against a fabricated ROCm tree and asserts both data dirs are bundled. Note: this bundles whatever gfx*.dat the build image's ROCm provides. If a given arch's tensile data is absent from the shipped ROCm, that arch still needs a ROCm bump; the packaging gap itself is fixed for every supported arch. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
mudler
approved these changes
Jul 3, 2026
l0caldadmin
added a commit
to l0caldadmin/LocalAI
that referenced
this pull request
Jul 4, 2026
* fix(gpu-libs): bundle hipBLASLt TensileLibrary data so ROCm backends stop falling back (mudler#10660) (mudler#10672) the The ROCm packager copied rocBLAS kernel data (rocblas/library/*.dat) into the bundled lib/ dir and run.sh pointed ROCBLAS_TENSILE_LIBPATH at it, but the parallel hipBLASLt data dir (hipblaslt/library/TensileLibrary_lazy_gfx*.dat) was never packaged and no HIPBLASLT_TENSILE_LIBPATH was set. The bundled libhipblaslt.so therefore resolved its per-arch kernel data relative to itself, found nothing, and silently fell back to slow generic kernels, logging: rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat" Fix, mirroring the existing rocBLAS handling: - package-gpu-libs.sh: extract the rocblas data-dir copy into a reusable copy_rocm_data_dir helper and call it for both rocblas and hipblaslt. - llama-cpp/turboquant run.sh: export HIPBLASLT_TENSILE_LIBPATH when the bundled hipblaslt/library dir exists. The helper takes an optional ROCM_BASE_DIRS override so the copy is unit testable without a real ROCm install; add a regression test that runs package_rocm_libs against a fabricated ROCm tree and asserts both data dirs are bundled. Note: this bundles whatever gfx*.dat the build image's ROCm provides. If a given arch's tensile data is absent from the shipped ROCm, that arch still needs a ROCm bump; the packaging gap itself is fixed for every supported arch. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> * chore: ⬆️ Update ggml-org/llama.cpp to `d4cff114c0084f1fbc9b4c62717eca8fb2ae494a` (mudler#10671) :arrow_up: Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * chore: :arrow_up: Update CrispStrobe/CrispASR to `f35185b876fc482fcb2053a81a2697936ed5fcc0` (mudler#10670) :arrow_up: Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * fix(backends): enable ROCm/HIP GPU offload for ggml audio backends (mudler#10666) (mudler#10667) qwen3-tts-cpp, omnivoice-cpp, acestep-cpp and vibevoice-cpp shipped rocm-* variants that silently ran on CPU ([Load] backend: CPU). Two coupled defects: - The Makefiles passed -DGGML_HIPBLAS=ON, but the vendored ggml only understands -DGGML_HIP=ON (GGML_HIPBLAS was removed upstream), so the ggml-hip backend target was never created and no GPU code was built. - The CMake foreach that links the ggml GPU backends into the module listed blas/cuda/metal/vulkan but not hip, so even a built ggml-hip would not have been linked and its static backend registration would never run. CUDA users were unaffected because cublas passes the correct GGML_CUDA=ON and the foreach already links cuda. Mirror the proven llama-cpp hipblas block (ROCm clang CC/CXX + AMDGPU_TARGETS) and add hip to each foreach. Upstream picks the best device via ggml_backend_init_best(), so no runtime flag is needed once HIP is compiled and linked. Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: LocalAI [bot] <139863280+localai-bot@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #10660.
The ROCm packager bundled rocBLAS kernel data (
rocblas/library/*.dat) into each backend'slib/dir andrun.shpointedROCBLAS_TENSILE_LIBPATHat it, but the parallel hipBLASLt data dir (hipblaslt/library/TensileLibrary_lazy_gfx*.dat) was never packaged and noHIPBLASLT_TENSILE_LIBPATHwas set.Because backends ship their own
libhipblaslt.sounderlib/(prioritized viaLD_LIBRARY_PATH), that lib resolved its per-arch kernel data relative to itself, found nothing, and silently fell back to slow generic kernels, logging:The reporter's attached log confirms the asymmetry:
ROCBLAS_TENSILE_LIBPATH=/backends/rocm-llama-cpp/lib/rocblas/libraryis set and working, while hipBLASLt has no bundled data and no env var. This affected every gfx arch, not just gfx1201.Fix
Mirrors the existing rocBLAS handling:
scripts/build/package-gpu-libs.sh— extract the rocblas data-dir copy into a reusablecopy_rocm_data_dirhelper and call it for bothrocblasandhipblaslt. (Also keeps the deliberate single-linelocal x=$(shopt -p nullglob)idiom to avoid trippingset -ewhen nullglob is unset.)backend/cpp/llama-cpp/run.sh+backend/cpp/turboquant/run.sh— exportHIPBLASLT_TENSILE_LIBPATHwhen the bundledhipblaslt/librarydir exists (the only two backends with the rocBLAS pattern).Tests
New regression test
scripts/build/package-gpu-libs-rocm-data_test.shrunspackage_rocm_libsagainst a fabricated ROCm tree (via a newROCM_BASE_DIRSoverride) and asserts bothrocblas/andhipblaslt/data dirs are bundled. Developed TDD: it fails on the pre-fix code (rocblas bundled, hipblaslt not) and passes after. The existingpackage-gpu-libs_test.sh(#10537) still passes.Caveat
This bundles whatever
gfx*.datthe build image's ROCm actually provides. gfx1201/RDNA4 tensile data landed in ROCm 6.4 — if the shipped ROCm predates it, that specific arch would still need a ROCm bump. The packaging gap itself is fixed for every supported arch. Verified at the unit level; not driven end-to-end on gfx1201 hardware.Notes
Assisted-by: Claude:claude-opus-4-8 [Claude Code]