Skip to content
This repository was archived by the owner on Jan 28, 2026. It is now read-only.
This repository was archived by the owner on Jan 28, 2026. It is now read-only.

SYCL multi-GPU inference fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY on Intel Arc Pro (single GPU works) #13335

@aaricantto

Description

@aaricantto

Summary

Multi-GPU inference using the SYCL / oneAPI backend fails with an out-of-device-memory error during a SYCL memcpy().wait() call, even though sufficient VRAM is available on each GPU. The same model and configuration work reliably on a single Intel GPU.

This appears to be a multi-GPU SYCL / Level Zero pipeline or cross-device copy issue, not a real VRAM exhaustion problem.

Environment

  • OS: Windows 11
  • Ollama build: ollama-ipex-llm 2.3.0b20250725 (Windows portable ZIP)
  • Ollama version: 0.9.3
  • Backend: SYCL / oneAPI (ggml-sycl)
  • oneAPI / Level Zero: oneAPI 2024.2 (bundled with build)
  • Model: nemotron-mini (GGUF, Q4_K, ~2.5 GiB)
  • Context size: 4096 (also reproduced with 2048)
  • Batch size: 512 (also reproduced with smaller values)
  • Parallel sequences: 1
  • KV cache: f16

GPUs

  • 2× Intel Arc Pro B60 (24 GiB VRAM each)
  • NVIDIA GPU also present in system, but Intel SYCL backend is explicitly selected

Reproduction Steps

Works (single GPU)

set ONEAPI_DEVICE_SELECTOR=level_zero:0
start-ollama.bat

ollama run nemotron-mini:latest

Model loads and runs correctly
Stable inference, high tokens/sec

Fails (multi-GPU)

set ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1
start-ollama.bat

ollama run nemotron-mini:latest

Model loads successfully
Fails on first inference request

Observed Error

Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)

Exception caught at:
ggml-sycl.cpp:4602
func: operator()

SYCL error:
CHECK_TRY_ERROR(
  (stream)->memcpy(data, (const char *)tensor->data + offset, size).wait()
)

in function:
ggml_backend_sycl_get_tensor_async

common.hpp:115: SYCL error

ERROR source=server.go:827 msg="post predict"
ERROR source=server.go:484 msg="llama runner terminated"
exit status 0xc0000409

Key Observations

This is not a real VRAM exhaustion issue

  • Each GPU has ~22 GiB free at runtime
  • Model uses <3 GiB weights + ~512 MiB KV cache
    The failure happens after successful model load, during inference
    The error is triggered inside a SYCL memcpy + wait, suggesting:
  • cross-device tensor movement
  • pipeline parallelism
  • or Level Zero memory management issues
    With two GPUs visible, logs show:
  • pipeline parallelism enabled
  • multiple graph splits
    With one GPU:
  • no pipeline parallelism
  • stable execution

Any help would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions