Skip to content

Conversation

nirandaperera
Copy link
Contributor

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
rmm::detail::format_bytes(size) + ")",
rmm::out_of_memory);
auto const block = this->underlying().get_block(size, stream_event);
auto const block = get_block(size, stream_event);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ question: ‏ Why drop the CRTP indirection here? This doesn't seem related to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harrism that's right. But when I was reading the code, what I gathered was, get_block is not implemented by the derived class. It's not mentioned here as well. https://github.com/nirandaperera/rmm/blob/adding_nvtx_pool/cpp/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp#L70-L76
So, IINM, we can simply call the method, without the indirection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Good catch.

#endif

#ifdef RMM_NVTX
void* heap_key;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this adds some overhead on every suballocation. And the insertion into the nvtx_heaps map is a small overhead on upstream allocations.

Can you please benchmark this cost with the random allocations benchmark with NVTX on and off and report it in the PR? Is NVTX enabled by default? Depending on these costs, we may want it off by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harrism Yes, there is an overhead here. In particular

  1. Inserting and querying from the nvtx_heaps_ unordered map.
  2. calling lower_bound on unstream_blocks_ set (which is logarithmic)

I think we can alleviate 2, if we add a void* upstream_ member to the block class, rather than the bool head. Then IINM, is_head() will be upstream_ == ptr_. But then, we are adding additional 3-bytes to the block class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think its a worthwhile change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see benchmarks, if you don't mind. :)

@harrism harrism added non-breaking Non-breaking change feature request New feature or request labels Jun 11, 2025
@nirandaperera nirandaperera requested a review from a team as a code owner June 12, 2025 00:02
@github-actions github-actions bot added the CMake label Jun 12, 2025
@nirandaperera nirandaperera marked this pull request as draft June 12, 2025 00:03
Copy link

copy-pr-bot bot commented Jun 12, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wence-
Copy link
Contributor

wence- commented Jul 30, 2025

@nirandaperera Have you had a chance to run the benchmarks Mark was looking for to see any perf differences?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake feature request New feature or request non-breaking Non-breaking change

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants