Skip to content

Add standalone C++/HIP examples investigating hipMemAllocationTypeUncached for P2P atomics#419

Draft
Copilot wants to merge 7 commits intomainfrom
copilot/fix-vmem-allocator-atomics
Draft

Add standalone C++/HIP examples investigating hipMemAllocationTypeUncached for P2P atomics#419
Copilot wants to merge 7 commits intomainfrom
copilot/fix-vmem-allocator-atomics

Conversation

Copy link
Contributor

Copilot AI commented Mar 4, 2026

hipMemAllocationTypeUncached (0x40000000) is accepted by hipMemCreate but CLR still hardcodes the coarse-grained GPU pool, so P2P atomics crash identically to hipMemAllocationTypePinned. These examples make that reproducible and debuggable on any machine.

iris/hip.py

  • Add hipMemAllocationTypeUncached = 0x40000000 constant
  • Add alloc_type parameter to mem_create() (default: hipMemAllocationTypePinned, backward-compatible)
  • Thread alloc_type through _HIPVMemP2P and _run_p2p_atomics_hip

tests/unittests/test_hsa_vmem_reproducible.py

  • test_hip_vmem_uncached_alloc_type: probes hipMemCreate with uncached type; skips if driver rejects it, otherwise tests local allocation + single-rank atomic only. Cross-rank access not tested — empirically it still causes SIGSEGV (same as pinned), confirming CLR ignores the type.
  • Replace magic tolerances 0.01 / 0.5 with named constants _ATOMIC_EXACT_TOL / _ATOMIC_COUNT_TOL

csrc/standalone/ (new)

Two minimal, self-contained C++ programs — no MPI, no torchrun. Each spawns two ranks via fork + execl("/proc/self/exe") (self-exec avoids HSA internal thread corruption from plain fork), coordinating over a socketpair with SCM_RIGHTS FD passing.

p2p_atomics_hsa.cpp — Path 3 (HSA fine-grained, correct)

// Allocate from the fine-grained pool directly — CLR never reaches this path
hsa_amd_vmem_handle_create(fine_grained_pool, size, MEMORY_TYPE_NONE, 0, &handle);
hsa_amd_vmem_export_shareable_handle(&dmabuf_fd, handle, 0);
// ... SCM_RIGHTS exchange ...
hsa_amd_vmem_import_shareable_handle(peer_fd, &peer_handle);
// P2P sys-scope atomics → 0/N failures

p2p_atomics_hip.cpp — Path 2 (HIP VMem, always coarse-grained)

./p2p_atomics_hip                # safe: P2P non-atomic read only
./p2p_atomics_hip --uncached     # same setup with hipMemAllocationTypeUncached
./p2p_atomics_hip --atomics      # WARNING: GPU page fault (SIGSEGV) — demonstrates the bug

CLR routes both --pinned and --uncached through the coarse-grained pool (CoarseGrain=1 in KFD). P2P atomics crash identically for both.

Implementation notes:

  • hipMemset/hipMemcpy(D2H) silently fail on HSA VMem VAs (not registered in HIP's pointer tables); init and readback use k_zero / k_copy kernels with a hipMalloc bounce buffer
  • k_copy calls __threadfence_system() before the read for cross-GPU write visibility
  • Agent-scope atomic results are informational only; sys-scope drives PASS/FAIL
Original prompt

This section details on the original issue you should resolve

<issue_title>VMem allocator and atomics</issue_title>
<issue_description>From previous tests, only the runner machine had issues running some atomics (all reduce for example) when using the vmem allocator. Could be a bad initialization flag. I am not able to reproduce the bug on other GPUs. Please use the vmem allocator, run all unittests and figure out a fix. For reference, consider shallow cloning https://github.com/ROCm/rocm-systems to understand how the low level apis. Your outcomes should be 2 things: 1: minimal repro that doesn't include any iris code that show the bug, 2: ideally a bug fix patch (I suspect the code had some wrong flag when allocating memory or something similar)</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits March 4, 2026 08:53
…with malloc_fine_grained

The root cause of intermittent P2P atomic failures with VMemAllocator was that
hipMemCreate creates coarse-grained GPU memory, which does not support correct
cross-GPU atomic operations for scope=cta or scope=gpu.

Changes:
- iris/allocators/vmem_allocator.py: Complete rewrite to use malloc_fine_grained
  (hipExtMallocWithFlags with hipDeviceMallocFinegrained) for physical memory.
  Add establish_peer_access() using import_dmabuf_handle (hipImportExternalMemory)
  which preserves fine-grained memory type for cross-GPU atomic correctness.
  Remove get_allocation_segments() (no longer needed for single allocation model).
  Change import_external_tensor() to copy semantics (same as TorchAllocator).
- iris/symmetric_heap.py: Update as_symmetric() docstring to reflect copy semantics
  for both torch and vmem allocators.
- tests/unittests/test_atomic_add_triton.py: Add test_atomic_add_vmem() covering
  all scope/sem combinations for VMemAllocator.
- tests/unittests/test_vmem_allocator.py: Update test_vmem_import_external_tensor
  to reflect copy semantics.
- tests/unittests/test_pytorch_import_mechanism.py: Update test_as_symmetric_basic
  to reflect copy semantics (no shared memory with external tensor).

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…r_heap_base

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix vmem allocator and atomics issue in runner machine Fix VMemAllocator P2P atomic failures by switching from coarse-grained to fine-grained memory Mar 4, 2026
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title Fix VMemAllocator P2P atomic failures by switching from coarse-grained to fine-grained memory Add HSA VMem (Path 3): direct fine-grained VMem APIs, minimal P2P repro, and HsaVMemAllocator Mar 4, 2026
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title Add HSA VMem (Path 3): direct fine-grained VMem APIs, minimal P2P repro, and HsaVMemAllocator Add HSA VMem (Path 3) as fine-grained alternative to broken HIP VMem for P2P atomics Mar 4, 2026
… probe test

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title Add HSA VMem (Path 3) as fine-grained alternative to broken HIP VMem for P2P atomics Confirm HSA VMem (Path 3) fixes P2P atomic coherency; probe hipMemAllocationTypeUncached Mar 4, 2026
…s HIP coarse-grained)

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title Confirm HSA VMem (Path 3) fixes P2P atomic coherency; probe hipMemAllocationTypeUncached Add standalone C++/HIP examples investigating hipMemAllocationTypeUncached for P2P atomics Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VMem allocator and atomics

2 participants