-
Notifications
You must be signed in to change notification settings - Fork 171
Description
Environment
- Hardware: Intel Arc 140V (Lunar Lake iGPU, unified LPDDR5X memory)
- OS: Windows 11
- torch version: 2.9.0+xpu
- Platform:
lightx2v_platformwithAI_DEVICE = "xpu"
Problem
When running block/phase offload on Intel XPU, WeightAsyncStreamManager uses the
same stream priority configuration as CUDA, which causes two distinct failures:
1. HIGH-priority stream crashes on compute kernels
The original code assigns compute_stream = Stream(priority=-1) (HIGH priority).
On Intel Arc XPU, HIGH-priority streams are only safe for copy_() operations.
Using them for heavy compute kernels (oneDNN matmul, attention) causes a hard crash:
RuntimeError: [oneDNN] ... stream priority conflict / illegal use of high-priority stream
2. priority=+1 (LOW) is not supported
torch.xpu.Stream(priority=1) raises an error on Arc 140V — only -1 (HIGH) and
0 (DEFAULT) are accepted.
3. Cross-stream memory visibility requires device-wide sync
After cuda_load_stream.synchronize() on XPU, tensors written in that stream are
not guaranteed to be visible to the compute stream. A per-stream sync is insufficient;
torch.xpu.synchronize() (device-wide) is required to ensure correct H2D prefetch
visibility before compute.
Root Cause
Intel XPU stream semantics differ from CUDA:
| CUDA | Intel XPU (Arc 140V) | |
|---|---|---|
HIGH stream (priority=-1) |
safe for all ops | copy-only, crashes on compute |
LOW stream (priority=+1) |
supported | not supported |
| Cross-stream visibility | per-stream sync sufficient | device-wide sync required |
Metadata
Metadata
Assignees
Labels
Type
Projects
Status