Add GPU support for find_in_set by viadea · Pull Request #14889 · NVIDIA/spark-rapids

viadea · 2026-05-26T21:59:54Z

Description

This adds GPU support for Spark find_in_set, which previously fell back because the accelerator did not register a GPU expression for org.apache.spark.sql.catalyst.expressions.FindInSet.

The implementation registers FindInSet in GpuOverrides and adds GpuFindInSet. For a columnar set argument, the implementation splits the comma-delimited set string, looks up the left-hand string in the resulting list, and returns Spark's 1-based index result as an integer.

For a scalar/literal set argument, the implementation avoids expanding and splitting the same set once per input row. Instead, it builds a small token-to-first-position dictionary for the set and maps the word column through that dictionary on the GPU. Scalar/scalar inputs are evaluated once and expanded to the batch size.

For a scalar word and columnar set argument, the implementation now has two GPU paths. When a Spark RAPIDS JNI build exposes StringUtils.findInSet, it calls that native single-pass GPU implementation for the dynamic RHS column. Until that JNI dependency is available, the code falls back to the low-cardinality repeated RHS path added here: it computes distinct RHS set strings per batch, evaluates the scalar word position once per distinct set on the host, and gathers the row results back on the GPU. A single-distinct RHS batch fills the output from one scalar position. High-cardinality RHS batches fall back to the generic split/list implementation to avoid copying large set columns to the host.

The native JNI follow-up is NVIDIA/spark-rapids-jni#4636. It keeps the Spark-specific find_in_set semantics in Spark RAPIDS JNI: null RHS rows remain null, missing values return 0, duplicate tokens return the first position, empty tokens are preserved, and words containing commas return 0. This is the preferred best-performance path for dynamic RHS columns. A cuDF PR is not filed yet because the current API is Spark-specific; the implementation can be promoted to libcudf if JNI review asks for a reusable cuDF string primitive.

The implementation handles scalar and columnar input combinations and preserves null-intolerant behavior, empty-token behavior, duplicate-token first-match behavior, and Spark's rule that words containing commas return 0.

This is a user-facing support change, so the generated supported-operator documentation and advanced configuration metadata are updated to include find_in_set.

Testing and performance

Test coverage adds test_find_in_set in string_test.py, covering literal and column inputs, missing values, empty strings, words containing commas, nulls, duplicate-token first-match behavior, and multibyte characters.

Validation:

git diff --check
Spark 3.5.3 / Scala 2.12 dist jar built from this branch with build/buildall --profile=353 --module=dist --parallel=1
Spark-side compile after direct-native dispatch change: JAVA_HOME=/opt/homebrew/opt/openjdk@17 mvn -pl sql-plugin -am -Dbuildver=353 -DskipTests -Dmaven.scaladoc.skip compile
Jenkins Spark RAPIDS dev jar build Jenkins file for Databricks release #243 from commit f0e723efb0a7d65060996f7f5a266cef2088284f with RAPIDS JNI jar from Add native find_in_set utility spark-rapids-jni#4636 commit 42847fdb0720163c71453ffcd12c65b288fea488
Dataproc GPU correctness sanity on a single-node T4 cluster with Dataproc 2.2 / Spark 3.5.3 / Scala 2.12, covering nulls, empty tokens, duplicate tokens, and words containing commas
Dataproc CPU/GPU benchmark on the same single-node T4 cluster shape
Spark RAPIDS JNI draft PR validation for Add native find_in_set utility spark-rapids-jni#4636: signoff, license header, shell check, and pre-commit passed

Dataproc benchmark resources:

Single-node Dataproc 2.2-debian12 cluster, no workers
Master machine type: n1-standard-16 with 16 vCPU, 60 GB memory, and a 200 GB pd-standard boot disk
GPU mode used 1 NVIDIA T4 attached to the same node
CPU mode used the same n1-standard-16 node with spark.rapids.sql.enabled=false; it was not a multi-node CPU cluster
Spark physical plans for these synthetic range benchmarks used 64 input partitions

Literal RHS benchmark shape:

Expression: find_in_set(cast(pmod(id, 64) as string), '0,1,...,63')
Rows: 10,000,000
Result aggregate: sum(pos), count(1)
Correctness: GPU and CPU both produced row_count=10000000 and sum_pos=325000000
GPU executed plan included GpuProject [find_in_set(...)], GpuHashAggregate, and GpuRange

Mode	Iteration 1	Iteration 2	Iteration 3	Warm avg
GPU enabled	31.547s	0.578s	0.481s	0.530s
RAPIDS SQL disabled	28.651s	0.943s	0.840s	0.892s

The literal-RHS optimized path shows about 1.7x speedup over CPU for warm iterations in this 1 T4 vs 16 vCPU operator-level benchmark.

Dynamic RHS column benchmark shape:

Expression: find_in_set('32', token_set)
Rows: 20,000,000
Partitions: 64
Warmup iterations: 2
Measured iterations: 5
Result aggregate: sum(pos)
RHS column cached before the measured loop to keep token_set as a real column and avoid Catalyst folding into a literal result
GPU executed plan included GpuProject [find_in_set(32, token_set)]
GPU mode used spark.plugins=com.nvidia.spark.SQLPlugin, 1 executor, 15 executor cores, 32 GB executor memory, and 1 T4
CPU mode used RAPIDS disabled, 2 executors x 8 cores, and 18 GB executor memory per executor on the same n1-standard-16 node

Case	GPU warm avg	CPU warm avg	GPU vs CPU
1 distinct RHS set string	0.792s	0.588s	0.74x
2 distinct RHS set strings	0.655s	0.450s	0.69x
5,000 distinct RHS strings	0.663s	0.439s	0.66x
60,000 distinct RHS strings	0.818s	0.420s	0.51x

Measured GPU iteration seconds:

Case	Iteration 1	Iteration 2	Iteration 3	Iteration 4	Iteration 5
1 distinct RHS set string	0.944s	0.777s	0.706s	0.742s	0.790s
2 distinct RHS set strings	0.745s	0.629s	0.611s	0.582s	0.708s
5,000 distinct RHS strings	0.654s	0.669s	0.722s	0.625s	0.642s
60,000 distinct RHS strings	0.852s	0.809s	0.807s	0.802s	0.822s

Measured CPU iteration seconds:

Case	Iteration 1	Iteration 2	Iteration 3	Iteration 4	Iteration 5
1 distinct RHS set string	0.665s	0.518s	0.580s	0.581s	0.597s
2 distinct RHS set strings	0.476s	0.441s	0.431s	0.453s	0.448s
5,000 distinct RHS strings	0.433s	0.453s	0.455s	0.422s	0.434s
60,000 distinct RHS strings	0.414s	0.422s	0.366s	0.473s	0.424s

The dynamic RHS native path is dramatically faster than the earlier generic split/list fallback, but it is still slower than the 16-vCPU CPU baseline on this cached operator-level benchmark. The direct native path also improves the low-cardinality/repeated RHS path by avoiding per-batch distinct/dictionary setup, but the remaining gap suggests further native-side work is needed, likely a lower-overhead libcudf string primitive or kernel path specialized for scalar word search in comma-delimited strings.

Checklists

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
(Please provide the names of the existing tests in the PR description.)
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Add GPU support for find_in_set

4bac393

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

viadea self-assigned this May 26, 2026

viadea added 3 commits May 26, 2026 15:07

Update find_in_set support docs

37fc2d2

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Update per-shim find_in_set generated files

2029158

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Reflow supported ops docs for find_in_set

ea6945c

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

sameerz added the feature request New feature or request label May 27, 2026

viadea added 5 commits May 26, 2026 18:05

Optimize find_in_set literal set lookup

f79fc02

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Optimize find_in_set repeated set column lookup

59cd109

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Use native find_in_set when available

db961f6

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Use repeated find_in_set native path

c913e17

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Prefer direct native find_in_set path

f0e723e

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU support for find_in_set#14889

Add GPU support for find_in_set#14889
viadea wants to merge 9 commits into
NVIDIA:mainfrom
viadea:codex/support-find-in-set

viadea commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

viadea commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing and performance

Checklists

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

viadea commented May 26, 2026 •

edited

Loading