Skip to content

Refactor kernel_density to use less memory#7833

Merged
rapids-bot[bot] merged 11 commits into
rapidsai:release/26.06from
Intron7:refactor-kernel-density
May 26, 2026
Merged

Refactor kernel_density to use less memory#7833
rapids-bot[bot] merged 11 commits into
rapidsai:release/26.06from
Intron7:refactor-kernel-density

Conversation

@Intron7
Copy link
Copy Markdown
Contributor

@Intron7 Intron7 commented Feb 26, 2026

Hey this is my first time working on the c++ / cython layer so....

I recently came across Welford's algorithm and I thought something similar should work for kernel density to not need to compute the full pairwise distance matrix. So this does now an online log-sum-exp with max tracking. This way we can run arbitrarily big embeddings without any memory issues and batching.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

This comment was marked as low quality.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
python/cuml/cuml/neighbors/kernel_density.py (1)

252-259: Consider using next(iter()) for single-value extraction.

Per static analysis (RUF015), prefer next(iter(self.metric_params.values())) over creating an intermediate list for a single element.

Suggested improvement
         if self.metric_params:
             if len(self.metric_params) != 1:
                 raise ValueError(
                     "Cuml only supports metrics with a single arg."
                 )
-            metric_arg = float(list(self.metric_params.values())[0])
+            metric_arg = float(next(iter(self.metric_params.values())))
         else:
             metric_arg = 2.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/neighbors/kernel_density.py` around lines 252 - 259, The
code in kernel_density.py currently converts metric_params.values() to a list to
extract a single value for metric_arg; replace that intermediate list with an
iterator-based fetch using next(iter(self.metric_params.values())) and cast it
to float (i.e., metric_arg = float(next(iter(self.metric_params.values()))))
while preserving the existing single-value length check and default branch;
update the block around the metric_params handling in the KernelDensity
implementation where metric_arg is assigned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 438-442: The CUDA kernel launch of kde_fused_kernel<T, M, K> is
missing a post-launch error check; include the RAFT CUDA utilities header
(raft/util/cuda_utils.cuh) and add a RAFT_CUDA_TRY(...) check immediately after
the kernel launch inside the same scope (e.g., after
kde_fused_kernel<<<...>>>(...)) to catch asynchronous launch errors; ensure the
RAFT_CUDA_TRY invocation uses the appropriate CUDA error query
(cudaGetLastError()/cudaPeekAtLastError() as provided by RAFT) and keep the
change local to the kernel launch block.

---

Nitpick comments:
In `@python/cuml/cuml/neighbors/kernel_density.py`:
- Around line 252-259: The code in kernel_density.py currently converts
metric_params.values() to a list to extract a single value for metric_arg;
replace that intermediate list with an iterator-based fetch using
next(iter(self.metric_params.values())) and cast it to float (i.e., metric_arg =
float(next(iter(self.metric_params.values())))) while preserving the existing
single-value length check and default branch; update the block around the
metric_params handling in the KernelDensity implementation where metric_arg is
assigned.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed4de0a and 7ae3918.

📒 Files selected for processing (7)
  • cpp/CMakeLists.txt
  • cpp/include/cuml/neighbors/kde.hpp
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/CMakeLists.txt
  • python/cuml/cuml/neighbors/kde.pyx
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/tests/test_kernel_density.py

Comment thread cpp/src/kde/kde.cu Outdated
@jcrist
Copy link
Copy Markdown
Member

jcrist commented Feb 27, 2026

Thanks for the PR! On a first brief skim the idea looks sound. I'm a bit wary of the code duplication between RAFT/cuvs/cuml here for distances, but it's honestly not so much code so worst case merging as is may be fine. Others more versed on the C++ side of things may have some suggestions though.

I probably won't have time to look more into this until Monday. One quick request I'd have if you have some time is to push up some more motivation for your use case here. How much of a memory savings is this providing for workloads you're running, and are there other benefits (perf, ...) worth noting? Any numbers you can provide to help motivate the change and use case would be very helpful here.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Feb 27, 2026

/ok to test 7ae3918

@jcrist jcrist requested review from jcrist and removed request for robertmaynard February 27, 2026 05:01
@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Feb 27, 2026

I have done some small benchmarks. For small datasets the performance is roughly the same the new implementation is 1.1x faster for (10000x10000). However for a bigger embedding (200000,200000) where I need to chunk to not blow up memory this is 11 times faster. The memory use is the most impactful part. It goes from (n x m) to (n + m) since we never compute this massive pairwise distance matrix. I was trying to use the raft distances. Some of them worked others didn't because they assume a different threadlayout. So I created custom distance functions.

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Intron7! This would be very helpful to scale kernel_density to larger problem sizes. I could review the CUDA code. It looks like there is a loop over all the train vectors which would not scale well. However, this new solution would save a lot of memory. I suggested some optimizations. Have you benchmarked the old vs new solution on a case with a small n_query and large n_train? I wonder if this is really a drop-in replacement for what we had.

Comment thread cpp/src/kde/kde.cu Outdated
Comment thread cpp/src/kde/kde.cu Outdated
Comment thread cpp/src/kde/kde.cu Outdated
Comment thread cpp/src/kde/kde.cu Outdated
Comment thread cpp/src/kde/kde.cu Outdated
@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Mar 4, 2026

For my current limited testing it's faster than the current implementation between 1.1 x faster to 11x faster. Also the speed being the same doesn't really matter if the other implementation breaks because a pairwise distance matrix blowing up the memory. I can definitely work on prefetching the data into shared memory. But right now it looks like the kernel is compute and not memory bound.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 550-551: The code calls cudaDeviceGetAttribute(&sm_count,
cudaDevAttrMultiProcessorCount, 0) with a hard-coded device ID; change it to
query the current device first (e.g., call cudaGetDevice to obtain the active
device) and pass that device variable into cudaDeviceGetAttribute so sm_count is
obtained for the active GPU. Locate the cudaDeviceGetAttribute usage around
sm_count and replace the literal 0 with the retrieved current device (or obtain
the device from the provided raft::handle_t if available) to make the operation
device-agnostic.
- Around line 583-610: The code allocates partial_max and partial_sum with
cudaMallocAsync and manually frees them, which leaks if RAFT_CUDA_TRY throws;
replace raw T* allocations with RAII rmm::device_uvector<T> (construct with
buf_elems and stream) and pass .data() to kde_tiled_kernel and
kde_reduce_kernel, remove the explicit cudaFreeAsync calls, and ensure
includes/namespace for rmm are added so allocations are automatically freed on
exception or scope exit.
- Line 396: Avoid taking log(0) by skipping the log when a sample weight is
zero: in kde.cu where log_k is incremented using weights, add a guard that
checks weights is non-null and that weights[j_base + c] is greater than T(0)
before calling log, e.g., only add log(weights[j_base + c]) when the weight > 0;
update the same check around any other places that assume positive weights
and/or alternatively enforce weights > 0 in kernel_density.py validation if you
prefer failing earlier.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 439759da-d557-4f91-95c2-c3c10d90adcb

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae3918 and 0de59e9.

📒 Files selected for processing (3)
  • cpp/include/cuml/neighbors/kde.hpp
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/kernel_density.py

Comment thread cpp/src/kde/kde.cu Outdated
Comment thread cpp/src/kde/kde.cu Outdated
Comment thread cpp/src/kde/kde.cu Outdated
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiled processing is a great addition for overall performance. Please add checks in the C++ API for d > 0, n_train > 0, n_query > 0, bandwidth > 0 with RAFT_EXPECTS.

Also, could you add some Pytest tests to check the different metrics and tiling layout for correctness against the reference KDE?

Comment thread cpp/src/kde/kde.cu Outdated
Comment thread cpp/src/kde/kde.cu Outdated
coderabbitai[bot]

This comment was marked as outdated.

Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated
@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Mar 13, 2026

Hey @Intron7 we have a kernel gram API in cuVS that handles pairwise distance / grammian computations for the other kernel methods like SVR/SVM. Rather than scattering these implementations across cuml and cuVs, we should really be aiming to consolidate them into a shared API of sorts, even if they end up dispatching to different impls at first. Just want to make sure we are representing algorithms with as much composability and reuse as possible.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cpp/src/kde/kde.cu (1)

84-91: ⚠️ Potential issue | 🟡 Minor

Minkowski distance: division by near-zero p remains unguarded.

If metric_arg (p) is zero or very close to zero, T(1) / p in finalize will produce infinity or extreme values. While this is an edge case (callers typically use p ≥ 1), consider either:

  1. Adding input validation in score_samples to require p > 0 when metric is Minkowski, or
  2. Documenting the constraint in the API.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/kde/kde.cu` around lines 84 - 91, The finalize implementation of
DistOp for ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide
by zero or near-zero p; update validation to ensure metric_arg (p) > 0 before
computing the power and surface the error to callers (e.g., in the score_samples
caller path) or clamp/handle tiny p values: add an explicit check for p <= 0 (or
p < epsilon) and return/report an error or fallback behavior, and document the
constraint for LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.
🧹 Nitpick comments (1)
python/cuml/tests/test_kernel_density.py (1)

346-353: Replace ambiguous multiplication sign in docstring.

Static analysis (RUF002) flags the × character as ambiguous. Consider using x or spelling out "by" for clarity.

Suggested fix
-def test_all_kernels_all_metrics(metric, kernel):
-    """Every metric × kernel combination produces output matching the reference.
+def test_all_kernels_all_metrics(metric, kernel):
+    """Every metric x kernel combination produces output matching the reference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_kernel_density.py` around lines 346 - 353, The
docstring in test_all_kernels_all_metrics uses the ambiguous multiplication sign
"×"; replace it with a clear ASCII alternative such as "x" or the word "by" so
static analysis (RUF002) no longer flags it — update the docstring text inside
the test_all_kernels_all_metrics function accordingly to read e.g. "Every metric
x kernel combination…" or "Every metric by kernel combination…".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/kde/kde.cu`:
- Around line 84-91: The finalize implementation of DistOp for
ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide by zero or
near-zero p; update validation to ensure metric_arg (p) > 0 before computing the
power and surface the error to callers (e.g., in the score_samples caller path)
or clamp/handle tiny p values: add an explicit check for p <= 0 (or p < epsilon)
and return/report an error or fallback behavior, and document the constraint for
LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.

---

Nitpick comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 346-353: The docstring in test_all_kernels_all_metrics uses the
ambiguous multiplication sign "×"; replace it with a clear ASCII alternative
such as "x" or the word "by" so static analysis (RUF002) no longer flags it —
update the docstring text inside the test_all_kernels_all_metrics function
accordingly to read e.g. "Every metric x kernel combination…" or "Every metric
by kernel combination…".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 69410cce-7cea-4904-be89-0c89f04c6bde

📥 Commits

Reviewing files that changed from the base of the PR and between e1a66b3 and 8827ad5.

📒 Files selected for processing (3)
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/tests/test_kernel_density.py

@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented Mar 13, 2026

rapidsai/cuvs#1915 is needed now for this PR since i moved the kernel to cuvs @cjnolet

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 382-389: In the docstring for test_all_kernels_all_metrics replace
the Unicode multiplication sign "×" with a plain ASCII "x" to avoid ambiguity
and ensure consistent encoding/reading across tools; update the string in the
function test_all_kernels_all_metrics accordingly so it reads "metric x kernel"
(or similar) instead of using the Unicode multiplication character.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b311b680-ea71-4df1-9695-d61ee7bbc297

📥 Commits

Reviewing files that changed from the base of the PR and between 8827ad5 and 2951567.

📒 Files selected for processing (4)
  • cpp/include/cuml/neighbors/kde.hpp
  • cpp/src/kde/kde.cu
  • python/cuml/cuml/neighbors/kde.pyx
  • python/cuml/tests/test_kernel_density.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/src/kde/kde.cu
  • cpp/include/cuml/neighbors/kde.hpp

Comment thread python/cuml/tests/test_kernel_density.py
@jcrist jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 20, 2026
@jcrist

This comment was marked as outdated.

Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kernel_density.py Outdated
Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few fixups needed.

Comment thread python/cuml/cuml/neighbors/kernel_density.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kernel_density.pyx Outdated
Comment thread python/cuml/cuml/neighbors/kernel_density.pyx Outdated
@jcrist
Copy link
Copy Markdown
Member

jcrist commented May 21, 2026

/ok to test 1814b4e

@Intron7
Copy link
Copy Markdown
Contributor Author

Intron7 commented May 21, 2026

@jcrist i don't think the dask failure is related.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented May 26, 2026

/ok to test 94f769f

@jcrist jcrist force-pushed the refactor-kernel-density branch from 94f769f to e93e492 Compare May 26, 2026 18:04
@jcrist
Copy link
Copy Markdown
Member

jcrist commented May 26, 2026

/ok to test e93e492

@jcrist jcrist dismissed cjnolet’s stale review May 26, 2026 18:05

Requests resolved already in upstream cuvs

Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Intron7! I've pushed a few fixups addressing my concerns to this branch. Provided tests pass, IMO this is good to go! Glad to have this in, thanks again!

@jcrist
Copy link
Copy Markdown
Member

jcrist commented May 26, 2026

/merge

@csadorf csadorf removed the request for review from jinsolp May 26, 2026 21:44
Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for packaging-codeowners, the CMake changes are small and non-controversial. I did not closely review anything else, it seems well-covered by other reviewers.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented May 26, 2026

/merge

@rapids-bot rapids-bot Bot merged commit fea609a into rapidsai:release/26.06 May 26, 2026
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants