Skip to content

[XPU] WeightAsyncStreamManager: stream priority and sync strategy need XPU-specific branch #961

@Tyr0727

Description

@Tyr0727

Environment

  • Hardware: Intel Arc 140V (Lunar Lake iGPU, unified LPDDR5X memory)
  • OS: Windows 11
  • torch version: 2.9.0+xpu
  • Platform: lightx2v_platform with AI_DEVICE = "xpu"

Problem

When running block/phase offload on Intel XPU, WeightAsyncStreamManager uses the
same stream priority configuration as CUDA, which causes two distinct failures:

1. HIGH-priority stream crashes on compute kernels

The original code assigns compute_stream = Stream(priority=-1) (HIGH priority).
On Intel Arc XPU, HIGH-priority streams are only safe for copy_() operations.
Using them for heavy compute kernels (oneDNN matmul, attention) causes a hard crash:
RuntimeError: [oneDNN] ... stream priority conflict / illegal use of high-priority stream

2. priority=+1 (LOW) is not supported

torch.xpu.Stream(priority=1) raises an error on Arc 140V — only -1 (HIGH) and
0 (DEFAULT) are accepted.

3. Cross-stream memory visibility requires device-wide sync

After cuda_load_stream.synchronize() on XPU, tensors written in that stream are
not guaranteed to be visible to the compute stream. A per-stream sync is insufficient;
torch.xpu.synchronize() (device-wide) is required to ensure correct H2D prefetch
visibility before compute.

Root Cause

Intel XPU stream semantics differ from CUDA:

CUDA Intel XPU (Arc 140V)
HIGH stream (priority=-1) safe for all ops copy-only, crashes on compute
LOW stream (priority=+1) supported not supported
Cross-stream visibility per-stream sync sufficient device-wide sync required

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions