Refactor kernel_density to use less memory by Intron7 · Pull Request #7833 · rapidsai/cuml

Intron7 · 2026-02-26T17:48:20Z

Hey this is my first time working on the c++ / cython layer so....

I recently came across Welford's algorithm and I thought something similar should work for kernel density to not need to compute the full pairwise distance matrix. So this does now an online log-sum-exp with max tracking. This way we can run arbitrarily big embeddings without any memory issues and batching.

copy-pr-bot · 2026-02-26T17:48:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

python/cuml/cuml/neighbors/kernel_density.py (1)

252-259: Consider using next(iter()) for single-value extraction.

Per static analysis (RUF015), prefer next(iter(self.metric_params.values())) over creating an intermediate list for a single element.

Suggested improvement

         if self.metric_params:
             if len(self.metric_params) != 1:
                 raise ValueError(
                     "Cuml only supports metrics with a single arg."
                 )
-            metric_arg = float(list(self.metric_params.values())[0])
+            metric_arg = float(next(iter(self.metric_params.values())))
         else:
             metric_arg = 2.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/neighbors/kernel_density.py` around lines 252 - 259, The
code in kernel_density.py currently converts metric_params.values() to a list to
extract a single value for metric_arg; replace that intermediate list with an
iterator-based fetch using next(iter(self.metric_params.values())) and cast it
to float (i.e., metric_arg = float(next(iter(self.metric_params.values()))))
while preserving the existing single-value length check and default branch;
update the block around the metric_params handling in the KernelDensity
implementation where metric_arg is assigned.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 438-442: The CUDA kernel launch of kde_fused_kernel<T, M, K> is
missing a post-launch error check; include the RAFT CUDA utilities header
(raft/util/cuda_utils.cuh) and add a RAFT_CUDA_TRY(...) check immediately after
the kernel launch inside the same scope (e.g., after
kde_fused_kernel<<<...>>>(...)) to catch asynchronous launch errors; ensure the
RAFT_CUDA_TRY invocation uses the appropriate CUDA error query
(cudaGetLastError()/cudaPeekAtLastError() as provided by RAFT) and keep the
change local to the kernel launch block.

---

Nitpick comments:
In `@python/cuml/cuml/neighbors/kernel_density.py`:
- Around line 252-259: The code in kernel_density.py currently converts
metric_params.values() to a list to extract a single value for metric_arg;
replace that intermediate list with an iterator-based fetch using
next(iter(self.metric_params.values())) and cast it to float (i.e., metric_arg =
float(next(iter(self.metric_params.values())))) while preserving the existing
single-value length check and default branch; update the block around the
metric_params handling in the KernelDensity implementation where metric_arg is
assigned.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed4de0a and 7ae3918.

📒 Files selected for processing (7)

cpp/CMakeLists.txt
cpp/include/cuml/neighbors/kde.hpp
cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/CMakeLists.txt
python/cuml/cuml/neighbors/kde.pyx
python/cuml/cuml/neighbors/kernel_density.py
python/cuml/tests/test_kernel_density.py

jcrist · 2026-02-27T04:59:51Z

Thanks for the PR! On a first brief skim the idea looks sound. I'm a bit wary of the code duplication between RAFT/cuvs/cuml here for distances, but it's honestly not so much code so worst case merging as is may be fine. Others more versed on the C++ side of things may have some suggestions though.

I probably won't have time to look more into this until Monday. One quick request I'd have if you have some time is to push up some more motivation for your use case here. How much of a memory savings is this providing for workloads you're running, and are there other benefits (perf, ...) worth noting? Any numbers you can provide to help motivate the change and use case would be very helpful here.

jcrist · 2026-02-27T05:00:39Z

/ok to test 7ae3918

Intron7 · 2026-02-27T07:28:40Z

I have done some small benchmarks. For small datasets the performance is roughly the same the new implementation is 1.1x faster for (10000x10000). However for a bigger embedding (200000,200000) where I need to chunk to not blow up memory this is 11 times faster. The memory use is the most impactful part. It goes from (n x m) to (n + m) since we never compute this massive pairwise distance matrix. I was trying to use the raft distances. Some of them worked others didn't because they assume a different threadlayout. So I created custom distance functions.

viclafargue

Thanks @Intron7! This would be very helpful to scale kernel_density to larger problem sizes. I could review the CUDA code. It looks like there is a loop over all the train vectors which would not scale well. However, this new solution would save a lot of memory. I suggested some optimizations. Have you benchmarked the old vs new solution on a case with a small n_query and large n_train? I wonder if this is really a drop-in replacement for what we had.

Intron7 · 2026-03-04T22:54:40Z

For my current limited testing it's faster than the current implementation between 1.1 x faster to 11x faster. Also the speed being the same doesn't really matter if the other implementation breaks because a pairwise distance matrix blowing up the memory. I can definitely work on prefetching the data into shared memory. But right now it looks like the kernel is compute and not memory bound.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/kde/kde.cu`:
- Around line 550-551: The code calls cudaDeviceGetAttribute(&sm_count,
cudaDevAttrMultiProcessorCount, 0) with a hard-coded device ID; change it to
query the current device first (e.g., call cudaGetDevice to obtain the active
device) and pass that device variable into cudaDeviceGetAttribute so sm_count is
obtained for the active GPU. Locate the cudaDeviceGetAttribute usage around
sm_count and replace the literal 0 with the retrieved current device (or obtain
the device from the provided raft::handle_t if available) to make the operation
device-agnostic.
- Around line 583-610: The code allocates partial_max and partial_sum with
cudaMallocAsync and manually frees them, which leaks if RAFT_CUDA_TRY throws;
replace raw T* allocations with RAII rmm::device_uvector<T> (construct with
buf_elems and stream) and pass .data() to kde_tiled_kernel and
kde_reduce_kernel, remove the explicit cudaFreeAsync calls, and ensure
includes/namespace for rmm are added so allocations are automatically freed on
exception or scope exit.
- Line 396: Avoid taking log(0) by skipping the log when a sample weight is
zero: in kde.cu where log_k is incremented using weights, add a guard that
checks weights is non-null and that weights[j_base + c] is greater than T(0)
before calling log, e.g., only add log(weights[j_base + c]) when the weight > 0;
update the same check around any other places that assume positive weights
and/or alternatively enforce weights > 0 in kernel_density.py validation if you
prefer failing earlier.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 439759da-d557-4f91-95c2-c3c10d90adcb

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae3918 and 0de59e9.

📒 Files selected for processing (3)

cpp/include/cuml/neighbors/kde.hpp
cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/kernel_density.py

viclafargue

Tiled processing is a great addition for overall performance. Please add checks in the C++ API for d > 0, n_train > 0, n_query > 0, bandwidth > 0 with RAFT_EXPECTS.

Also, could you add some Pytest tests to check the different metrics and tiling layout for correctness against the reference KDE?

cjnolet · 2026-03-13T12:53:15Z

Hey @Intron7 we have a kernel gram API in cuVS that handles pairwise distance / grammian computations for the other kernel methods like SVR/SVM. Rather than scattering these implementations across cuml and cuVs, we should really be aiming to consolidate them into a shared API of sorts, even if they end up dispatching to different impls at first. Just want to make sure we are representing algorithms with as much composability and reuse as possible.

coderabbitai

♻️ Duplicate comments (1)

cpp/src/kde/kde.cu (1)
84-91: ⚠️ Potential issue | 🟡 Minor

Minkowski distance: division by near-zero p remains unguarded.

If metric_arg (p) is zero or very close to zero, T(1) / p in finalize will produce infinity or extreme values. While this is an edge case (callers typically use p ≥ 1), consider either:

Adding input validation in score_samples to require p > 0 when metric is Minkowski, or

Documenting the constraint in the API.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/kde/kde.cu` around lines 84 - 91, The finalize implementation of
DistOp for ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide
by zero or near-zero p; update validation to ensure metric_arg (p) > 0 before
computing the power and surface the error to callers (e.g., in the score_samples
caller path) or clamp/handle tiny p values: add an explicit check for p <= 0 (or
p < epsilon) and return/report an error or fallback behavior, and document the
constraint for LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.

🧹 Nitpick comments (1)

python/cuml/tests/test_kernel_density.py (1)

346-353: Replace ambiguous multiplication sign in docstring.

Static analysis (RUF002) flags the × character as ambiguous. Consider using x or spelling out "by" for clarity.

Suggested fix

-def test_all_kernels_all_metrics(metric, kernel):
-    """Every metric × kernel combination produces output matching the reference.
+def test_all_kernels_all_metrics(metric, kernel):
+    """Every metric x kernel combination produces output matching the reference.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_kernel_density.py` around lines 346 - 353, The
docstring in test_all_kernels_all_metrics uses the ambiguous multiplication sign
"×"; replace it with a clear ASCII alternative such as "x" or the word "by" so
static analysis (RUF002) no longer flags it — update the docstring text inside
the test_all_kernels_all_metrics function accordingly to read e.g. "Every metric
x kernel combination…" or "Every metric by kernel combination…".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/kde/kde.cu`:
- Around line 84-91: The finalize implementation of DistOp for
ML::distance::DistanceType::LpUnexpanded uses T(1)/p which can divide by zero or
near-zero p; update validation to ensure metric_arg (p) > 0 before computing the
power and surface the error to callers (e.g., in the score_samples caller path)
or clamp/handle tiny p values: add an explicit check for p <= 0 (or p < epsilon)
and return/report an error or fallback behavior, and document the constraint for
LpUnexpanded; reference DistOp<T,
ML::distance::DistanceType::LpUnexpanded>::finalize, accumulate, and the
score_samples codepath that provides metric_arg.

---

Nitpick comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 346-353: The docstring in test_all_kernels_all_metrics uses the
ambiguous multiplication sign "×"; replace it with a clear ASCII alternative
such as "x" or the word "by" so static analysis (RUF002) no longer flags it —
update the docstring text inside the test_all_kernels_all_metrics function
accordingly to read e.g. "Every metric x kernel combination…" or "Every metric
by kernel combination…".

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 69410cce-7cea-4904-be89-0c89f04c6bde

📥 Commits

Reviewing files that changed from the base of the PR and between e1a66b3 and 8827ad5.

📒 Files selected for processing (3)

cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/kernel_density.py
python/cuml/tests/test_kernel_density.py

Intron7 · 2026-03-13T17:17:40Z

rapidsai/cuvs#1915 is needed now for this PR since i moved the kernel to cuvs @cjnolet

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/tests/test_kernel_density.py`:
- Around line 382-389: In the docstring for test_all_kernels_all_metrics replace
the Unicode multiplication sign "×" with a plain ASCII "x" to avoid ambiguity
and ensure consistent encoding/reading across tools; update the string in the
function test_all_kernels_all_metrics accordingly so it reads "metric x kernel"
(or similar) instead of using the Unicode multiplication character.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b311b680-ea71-4df1-9695-d61ee7bbc297

📥 Commits

Reviewing files that changed from the base of the PR and between 8827ad5 and 2951567.

📒 Files selected for processing (4)

cpp/include/cuml/neighbors/kde.hpp
cpp/src/kde/kde.cu
python/cuml/cuml/neighbors/kde.pyx
python/cuml/tests/test_kernel_density.py

🚧 Files skipped from review as they are similar to previous changes (2)

cpp/src/kde/kde.cu
cpp/include/cuml/neighbors/kde.hpp

jcrist

Just a few fixups needed.

jcrist · 2026-05-21T16:53:42Z

/ok to test 1814b4e

Intron7 · 2026-05-21T19:28:35Z

@jcrist i don't think the dask failure is related.

csadorf · 2026-05-26T15:13:01Z

/ok to test 94f769f

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

jcrist · 2026-05-26T18:04:45Z

/ok to test e93e492

Requests resolved already in upstream cuvs

jcrist

Thanks @Intron7! I've pushed a few fixups addressing my concerns to this branch. Provided tests pass, IMO this is good to go! Glad to have this in, thanks again!

jcrist · 2026-05-26T19:06:36Z

/merge

jameslamb

Approving for packaging-codeowners, the CMake changes are small and non-controversial. I did not closely review anything else, it seems well-covered by other reviewers.

csadorf · 2026-05-26T22:01:28Z

/merge

Intron7 requested review from a team as code owners February 26, 2026 17:48

Intron7 requested review from betatim, jinsolp, robertmaynard and viclafargue February 26, 2026 17:48

github-actions Bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels Feb 26, 2026

github-actions Bot assigned Intron7 Feb 26, 2026

This comment was marked as low quality.

Sign in to view

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread cpp/src/kde/kde.cu Outdated

jcrist requested review from jcrist and removed request for robertmaynard February 27, 2026 05:01

viclafargue reviewed Mar 4, 2026

View reviewed changes

Comment thread cpp/src/kde/kde.cu Outdated

Comment thread cpp/src/kde/kde.cu Outdated

Comment thread cpp/src/kde/kde.cu Outdated

Comment thread cpp/src/kde/kde.cu Outdated

Comment thread cpp/src/kde/kde.cu Outdated

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread cpp/src/kde/kde.cu Outdated

Comment thread cpp/src/kde/kde.cu Outdated

Comment thread cpp/src/kde/kde.cu Outdated

viclafargue reviewed Mar 9, 2026

View reviewed changes

Comment thread cpp/src/kde/kde.cu Outdated

Comment thread cpp/src/kde/kde.cu Outdated

This comment was marked as outdated.

Sign in to view

cjnolet previously requested changes Mar 13, 2026

View reviewed changes

Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated

Comment thread python/cuml/cuml/neighbors/kde.pyx Outdated

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Intron7 mentioned this pull request Mar 13, 2026

Add KDE kernel rapidsai/cuvs#1915

Merged

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread python/cuml/tests/test_kernel_density.py

jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 20, 2026

This comment was marked as outdated.

Sign in to view

jcrist reviewed May 20, 2026

View reviewed changes

jcrist requested changes May 21, 2026

View reviewed changes

Comment thread python/cuml/cuml/neighbors/kernel_density.pyx Outdated

Comment thread python/cuml/cuml/neighbors/kernel_density.pyx Outdated

Comment thread python/cuml/cuml/neighbors/kernel_density.pyx Outdated

Intron7 and others added 11 commits May 26, 2026 12:48

add refactor

d52a211

update kernel

4a6006e

Update cpp/src/kde/kde.cu

0d1a397

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

update test and adress coderabbit

103c1b2

move kernel to cuvs

5243ecf

update for new cuvs API

96490a8

add docstring and int64_t

f4f1784

add russellrao

5d6e337

fix build and linking

1e673e2

Address KDE review feedback

abe1a4b

Fixups

e93e492

jcrist force-pushed the refactor-kernel-density branch from 94f769f to e93e492 Compare May 26, 2026 18:04

jcrist approved these changes May 26, 2026

View reviewed changes

csadorf removed the request for review from jinsolp May 26, 2026 21:44

jameslamb approved these changes May 26, 2026

View reviewed changes

csadorf approved these changes May 26, 2026

View reviewed changes

rapids-bot Bot merged commit fea609a into rapidsai:release/26.06 May 26, 2026
102 checks passed

coderabbitai Bot mentioned this pull request May 27, 2026

Fix KDE score_samples symbol exports #8173

Merged

Conversation

Intron7 commented Feb 26, 2026

Uh oh!

copy-pr-bot Bot commented Feb 26, 2026

Uh oh!

This comment was marked as low quality.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcrist commented Feb 27, 2026

Uh oh!

jcrist commented Feb 27, 2026

Uh oh!

Intron7 commented Feb 27, 2026

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Intron7 commented Mar 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

cjnolet commented Mar 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Intron7 commented Mar 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcrist commented May 21, 2026

Uh oh!

Intron7 commented May 21, 2026

Uh oh!

csadorf commented May 26, 2026

Uh oh!

jcrist commented May 26, 2026

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

jcrist commented May 26, 2026

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment