[XPU] WeightAsyncStreamManager: stream priority and sync strategy need XPU-specific branch

## Environment
  - Hardware: Intel Arc 140V (Lunar Lake iGPU, unified LPDDR5X memory)
  - OS: Windows 11
  - torch version: 2.9.0+xpu
  - Platform: `lightx2v_platform` with `AI_DEVICE = "xpu"`

  ## Problem
  When running block/phase offload on Intel XPU, `WeightAsyncStreamManager` uses the
  same stream priority configuration as CUDA, which causes two distinct failures:

  ### 1. HIGH-priority stream crashes on compute kernels
  The original code assigns `compute_stream = Stream(priority=-1)` (HIGH priority).
  On Intel Arc XPU, HIGH-priority streams are only safe for `copy_()` operations.
  Using them for heavy compute kernels (oneDNN matmul, attention) causes a hard crash:
  RuntimeError: [oneDNN] ... stream priority conflict / illegal use of high-priority stream

  ### 2. `priority=+1` (LOW) is not supported
  `torch.xpu.Stream(priority=1)` raises an error on Arc 140V — only `-1` (HIGH) and
  `0` (DEFAULT) are accepted.

 ### 3. Cross-stream memory visibility requires device-wide sync
  After `cuda_load_stream.synchronize()` on XPU, tensors written in that stream are
  not guaranteed to be visible to the compute stream. A per-stream sync is insufficient;
  `torch.xpu.synchronize()` (device-wide) is required to ensure correct H2D prefetch
  visibility before compute.
  ## Root Cause
  Intel XPU stream semantics differ from CUDA:
  | | CUDA | Intel XPU (Arc 140V) |
  |---|---|---|
  | HIGH stream (`priority=-1`) | safe for all ops | **copy-only**, crashes on compute |
  | LOW stream (`priority=+1`) | supported | **not supported** |
  | Cross-stream visibility | per-stream sync sufficient | device-wide sync required |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] WeightAsyncStreamManager: stream priority and sync strategy need XPU-specific branch #961

Environment

Problem

1. HIGH-priority stream crashes on compute kernels

2. `priority=+1` (LOW) is not supported

3. Cross-stream memory visibility requires device-wide sync

Root Cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	CUDA	Intel XPU (Arc 140V)
HIGH stream (`priority=-1`)	safe for all ops	copy-only, crashes on compute
LOW stream (`priority=+1`)	supported	not supported
Cross-stream visibility	per-stream sync sufficient	device-wide sync required

[XPU] WeightAsyncStreamManager: stream priority and sync strategy need XPU-specific branch #961

Description

Environment

Problem

1. HIGH-priority stream crashes on compute kernels

2. priority=+1 (LOW) is not supported

3. Cross-stream memory visibility requires device-wide sync

Root Cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. `priority=+1` (LOW) is not supported