Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #42 +/- ##
=======================================
Coverage 99.48% 99.48%
=======================================
Files 9 9
Lines 195 195
=======================================
Hits 194 194
Misses 1 1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| const auto to_local_node_data = to_local_node.data_ptr<scalar_t>(); | ||
| auto deg_data = deg.data_ptr<scalar_t>(); | ||
|
|
||
| // Compute induced subgraph degree, parallelize with 32 threads per node: |
There was a problem hiding this comment.
I'm actually not sure if it is necessary to parallelize with 32 threads per nodes. Most of the time we are dealing with sparse data and a lot of threads will not go into for loop.
If you are looking for extreme performance, you can bundle to_local_node_data and col_data into one iterator structure and use this function. I haven't seen any better performance than it in the past.
https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html#a4854a13561cb66d46aa617aab16b8825
There was a problem hiding this comment.
Do you have an example of bundling to_local_node_data and col_data into one iterator structure? This looks really interesting.
I am okay with dropping the warp-level parallelism for now, but we will lose the contiguous access to col_data, and probably under-utilize the number of threads available on modern GPUs.
There was a problem hiding this comment.
On a second look, this doesn't seem possible since col_data refers to edges, while to_local_node_data refers to nodes, while we actually want do the compute across the number of nodes in the induced subgraph.
| // We maintain a O(N) vector to map global node indices to local ones. | ||
| // TODO Can we do this without O(N) storage requirement? | ||
| const auto to_local_node = nodes.new_full({rowptr.size(0) - 1}, -1); |
There was a problem hiding this comment.
Does N means the number of nodes in the graph?
What if we could filter each node in nodes_data since it should be much smaller than rowptr_data.
Otherwise we may consider caching this tensor to reduce memory allocation for each time.
There was a problem hiding this comment.
Good points! We use this vector as the mapping from global node indices to new local ones. In C++, we use a map for this but can't do the same in CUDA. I don't know of a more elegant solution for this.
Caching is an option as well, but requires a (non-intuitive and backend-specific) change in input arguments. I added it as a TODO for now.
There was a problem hiding this comment.
There's GPU hash table/set which may require some atomic operations when you build it, but lookup is fast.
I found that caching is not a good option since you have to reset the array every time.
Since you can sample on GPU, then the graph is not that big, a node array is not that bad and can make the code less complicated
yaoyaowd
left a comment
There was a problem hiding this comment.
Let's see if we can improve it later.
No description provided.