SYCL multi-GPU inference fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY on Intel Arc Pro (single GPU works)

# Summary

Multi-GPU inference using the SYCL / oneAPI backend fails with an out-of-device-memory error during a SYCL memcpy().wait() call, even though sufficient VRAM is available on each GPU. The same model and configuration work reliably on a single Intel GPU.

This appears to be a multi-GPU SYCL / Level Zero pipeline or cross-device copy issue, not a real VRAM exhaustion problem.

# Environment
- OS: Windows 11
- Ollama build: ollama-ipex-llm 2.3.0b20250725 (Windows portable ZIP)
- Ollama version: 0.9.3
- Backend: SYCL / oneAPI (ggml-sycl)
- oneAPI / Level Zero: oneAPI 2024.2 (bundled with build)
- Model: nemotron-mini (GGUF, Q4_K, ~2.5 GiB)
- Context size: 4096 (also reproduced with 2048)
- Batch size: 512 (also reproduced with smaller values)
- Parallel sequences: 1
- KV cache: f16

# GPUs
- 2× Intel Arc Pro B60 (24 GiB VRAM each)
- NVIDIA GPU also present in system, but Intel SYCL backend is explicitly selected

# Reproduction Steps
## Works (single GPU)
```
set ONEAPI_DEVICE_SELECTOR=level_zero:0
start-ollama.bat

ollama run nemotron-mini:latest
```
Model loads and runs correctly
Stable inference, high tokens/sec


## Fails (multi-GPU)
```
set ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1
start-ollama.bat

ollama run nemotron-mini:latest
```
Model loads successfully
Fails on first inference request


# Observed Error
```
Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)

Exception caught at:
ggml-sycl.cpp:4602
func: operator()

SYCL error:
CHECK_TRY_ERROR(
  (stream)->memcpy(data, (const char *)tensor->data + offset, size).wait()
)

in function:
ggml_backend_sycl_get_tensor_async

common.hpp:115: SYCL error

ERROR source=server.go:827 msg="post predict"
ERROR source=server.go:484 msg="llama runner terminated"
exit status 0xc0000409
```

# Key Observations
This is not a real VRAM exhaustion issue
- Each GPU has ~22 GiB free at runtime
- Model uses <3 GiB weights + ~512 MiB KV cache
The failure happens after successful model load, during inference
The error is triggered inside a SYCL memcpy + wait, suggesting:
- cross-device tensor movement
- pipeline parallelism
- or Level Zero memory management issues
With two GPUs visible, logs show:
- pipeline parallelism enabled
- multiple graph splits
With one GPU:
- no pipeline parallelism
- stable execution

Any help would be appreciated. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL multi-GPU inference fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY on Intel Arc Pro (single GPU works) #13335

Summary

Environment

GPUs

Reproduction Steps

Works (single GPU)

Fails (multi-GPU)

Observed Error

Key Observations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SYCL multi-GPU inference fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY on Intel Arc Pro (single GPU works) #13335

Description

Summary

Environment

GPUs

Reproduction Steps

Works (single GPU)

Fails (multi-GPU)

Observed Error

Key Observations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions