Add standalone C++/HIP examples investigating hipMemAllocationTypeUncached for P2P atomics#419
Draft
Add standalone C++/HIP examples investigating hipMemAllocationTypeUncached for P2P atomics#419
Conversation
…with malloc_fine_grained The root cause of intermittent P2P atomic failures with VMemAllocator was that hipMemCreate creates coarse-grained GPU memory, which does not support correct cross-GPU atomic operations for scope=cta or scope=gpu. Changes: - iris/allocators/vmem_allocator.py: Complete rewrite to use malloc_fine_grained (hipExtMallocWithFlags with hipDeviceMallocFinegrained) for physical memory. Add establish_peer_access() using import_dmabuf_handle (hipImportExternalMemory) which preserves fine-grained memory type for cross-GPU atomic correctness. Remove get_allocation_segments() (no longer needed for single allocation model). Change import_external_tensor() to copy semantics (same as TorchAllocator). - iris/symmetric_heap.py: Update as_symmetric() docstring to reflect copy semantics for both torch and vmem allocators. - tests/unittests/test_atomic_add_triton.py: Add test_atomic_add_vmem() covering all scope/sem combinations for VMemAllocator. - tests/unittests/test_vmem_allocator.py: Update test_vmem_import_external_tensor to reflect copy semantics. - tests/unittests/test_pytorch_import_mechanism.py: Update test_as_symmetric_basic to reflect copy semantics (no shared memory with external tensor). Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…r_heap_base Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix vmem allocator and atomics issue in runner machine
Fix VMemAllocator P2P atomic failures by switching from coarse-grained to fine-grained memory
Mar 4, 2026
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot
AI
changed the title
Fix VMemAllocator P2P atomic failures by switching from coarse-grained to fine-grained memory
Add HSA VMem (Path 3): direct fine-grained VMem APIs, minimal P2P repro, and HsaVMemAllocator
Mar 4, 2026
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot
AI
changed the title
Add HSA VMem (Path 3): direct fine-grained VMem APIs, minimal P2P repro, and HsaVMemAllocator
Add HSA VMem (Path 3) as fine-grained alternative to broken HIP VMem for P2P atomics
Mar 4, 2026
… probe test Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot
AI
changed the title
Add HSA VMem (Path 3) as fine-grained alternative to broken HIP VMem for P2P atomics
Confirm HSA VMem (Path 3) fixes P2P atomic coherency; probe hipMemAllocationTypeUncached
Mar 4, 2026
…s HIP coarse-grained) Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot
AI
changed the title
Confirm HSA VMem (Path 3) fixes P2P atomic coherency; probe hipMemAllocationTypeUncached
Add standalone C++/HIP examples investigating hipMemAllocationTypeUncached for P2P atomics
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
hipMemAllocationTypeUncached(0x40000000) is accepted byhipMemCreatebut CLR still hardcodes the coarse-grained GPU pool, so P2P atomics crash identically tohipMemAllocationTypePinned. These examples make that reproducible and debuggable on any machine.iris/hip.py
hipMemAllocationTypeUncached = 0x40000000constantalloc_typeparameter tomem_create()(default:hipMemAllocationTypePinned, backward-compatible)alloc_typethrough_HIPVMemP2Pand_run_p2p_atomics_hiptests/unittests/test_hsa_vmem_reproducible.py
test_hip_vmem_uncached_alloc_type: probeshipMemCreatewith uncached type; skips if driver rejects it, otherwise tests local allocation + single-rank atomic only. Cross-rank access not tested — empirically it still causes SIGSEGV (same as pinned), confirming CLR ignores the type.0.01/0.5with named constants_ATOMIC_EXACT_TOL/_ATOMIC_COUNT_TOLcsrc/standalone/ (new)
Two minimal, self-contained C++ programs — no MPI, no torchrun. Each spawns two ranks via
fork+execl("/proc/self/exe")(self-exec avoids HSA internal thread corruption from plainfork), coordinating over asocketpairwith SCM_RIGHTS FD passing.p2p_atomics_hsa.cpp— Path 3 (HSA fine-grained, correct)p2p_atomics_hip.cpp— Path 2 (HIP VMem, always coarse-grained)CLR routes both
--pinnedand--uncachedthrough the coarse-grained pool (CoarseGrain=1in KFD). P2P atomics crash identically for both.Implementation notes:
hipMemset/hipMemcpy(D2H)silently fail on HSA VMem VAs (not registered in HIP's pointer tables); init and readback usek_zero/k_copykernels with ahipMallocbounce bufferk_copycalls__threadfence_system()before the read for cross-GPU write visibilityOriginal prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.