Skip to content

Add GPU support for find_in_set#14889

Draft
viadea wants to merge 9 commits into
NVIDIA:mainfrom
viadea:codex/support-find-in-set
Draft

Add GPU support for find_in_set#14889
viadea wants to merge 9 commits into
NVIDIA:mainfrom
viadea:codex/support-find-in-set

Conversation

@viadea
Copy link
Copy Markdown
Collaborator

@viadea viadea commented May 26, 2026

Fixes #8627.

Description

This adds GPU support for Spark find_in_set, which previously fell back because the accelerator did not register a GPU expression for org.apache.spark.sql.catalyst.expressions.FindInSet.

The implementation registers FindInSet in GpuOverrides and adds GpuFindInSet. For a columnar set argument, the implementation splits the comma-delimited set string, looks up the left-hand string in the resulting list, and returns Spark's 1-based index result as an integer.

For a scalar/literal set argument, the implementation avoids expanding and splitting the same set once per input row. Instead, it builds a small token-to-first-position dictionary for the set and maps the word column through that dictionary on the GPU. Scalar/scalar inputs are evaluated once and expanded to the batch size.

For a scalar word and columnar set argument, the implementation now has two GPU paths. When a Spark RAPIDS JNI build exposes StringUtils.findInSet, it calls that native single-pass GPU implementation for the dynamic RHS column. Until that JNI dependency is available, the code falls back to the low-cardinality repeated RHS path added here: it computes distinct RHS set strings per batch, evaluates the scalar word position once per distinct set on the host, and gathers the row results back on the GPU. A single-distinct RHS batch fills the output from one scalar position. High-cardinality RHS batches fall back to the generic split/list implementation to avoid copying large set columns to the host.

The native JNI follow-up is NVIDIA/spark-rapids-jni#4636. It keeps the Spark-specific find_in_set semantics in Spark RAPIDS JNI: null RHS rows remain null, missing values return 0, duplicate tokens return the first position, empty tokens are preserved, and words containing commas return 0. This is the preferred best-performance path for dynamic RHS columns. A cuDF PR is not filed yet because the current API is Spark-specific; the implementation can be promoted to libcudf if JNI review asks for a reusable cuDF string primitive.

The implementation handles scalar and columnar input combinations and preserves null-intolerant behavior, empty-token behavior, duplicate-token first-match behavior, and Spark's rule that words containing commas return 0.

This is a user-facing support change, so the generated supported-operator documentation and advanced configuration metadata are updated to include find_in_set.

Testing and performance

Test coverage adds test_find_in_set in string_test.py, covering literal and column inputs, missing values, empty strings, words containing commas, nulls, duplicate-token first-match behavior, and multibyte characters.

Validation:

  • git diff --check
  • Spark 3.5.3 / Scala 2.12 dist jar built from this branch with build/buildall --profile=353 --module=dist --parallel=1
  • Spark-side compile after direct-native dispatch change: JAVA_HOME=/opt/homebrew/opt/openjdk@17 mvn -pl sql-plugin -am -Dbuildver=353 -DskipTests -Dmaven.scaladoc.skip compile
  • Jenkins Spark RAPIDS dev jar build Jenkins file for Databricks release #243 from commit f0e723efb0a7d65060996f7f5a266cef2088284f with RAPIDS JNI jar from Add native find_in_set utility spark-rapids-jni#4636 commit 42847fdb0720163c71453ffcd12c65b288fea488
  • Dataproc GPU correctness sanity on a single-node T4 cluster with Dataproc 2.2 / Spark 3.5.3 / Scala 2.12, covering nulls, empty tokens, duplicate tokens, and words containing commas
  • Dataproc CPU/GPU benchmark on the same single-node T4 cluster shape
  • Spark RAPIDS JNI draft PR validation for Add native find_in_set utility spark-rapids-jni#4636: signoff, license header, shell check, and pre-commit passed

Dataproc benchmark resources:

  • Single-node Dataproc 2.2-debian12 cluster, no workers
  • Master machine type: n1-standard-16 with 16 vCPU, 60 GB memory, and a 200 GB pd-standard boot disk
  • GPU mode used 1 NVIDIA T4 attached to the same node
  • CPU mode used the same n1-standard-16 node with spark.rapids.sql.enabled=false; it was not a multi-node CPU cluster
  • Spark physical plans for these synthetic range benchmarks used 64 input partitions

Literal RHS benchmark shape:

  • Expression: find_in_set(cast(pmod(id, 64) as string), '0,1,...,63')
  • Rows: 10,000,000
  • Result aggregate: sum(pos), count(1)
  • Correctness: GPU and CPU both produced row_count=10000000 and sum_pos=325000000
  • GPU executed plan included GpuProject [find_in_set(...)], GpuHashAggregate, and GpuRange
Mode Iteration 1 Iteration 2 Iteration 3 Warm avg
GPU enabled 31.547s 0.578s 0.481s 0.530s
RAPIDS SQL disabled 28.651s 0.943s 0.840s 0.892s

The literal-RHS optimized path shows about 1.7x speedup over CPU for warm iterations in this 1 T4 vs 16 vCPU operator-level benchmark.

Dynamic RHS column benchmark shape:

  • Expression: find_in_set('32', token_set)
  • Rows: 20,000,000
  • Partitions: 64
  • Warmup iterations: 2
  • Measured iterations: 5
  • Result aggregate: sum(pos)
  • RHS column cached before the measured loop to keep token_set as a real column and avoid Catalyst folding into a literal result
  • GPU executed plan included GpuProject [find_in_set(32, token_set)]
  • GPU mode used spark.plugins=com.nvidia.spark.SQLPlugin, 1 executor, 15 executor cores, 32 GB executor memory, and 1 T4
  • CPU mode used RAPIDS disabled, 2 executors x 8 cores, and 18 GB executor memory per executor on the same n1-standard-16 node
Case GPU warm avg CPU warm avg GPU vs CPU
1 distinct RHS set string 0.792s 0.588s 0.74x
2 distinct RHS set strings 0.655s 0.450s 0.69x
5,000 distinct RHS strings 0.663s 0.439s 0.66x
60,000 distinct RHS strings 0.818s 0.420s 0.51x

Measured GPU iteration seconds:

Case Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
1 distinct RHS set string 0.944s 0.777s 0.706s 0.742s 0.790s
2 distinct RHS set strings 0.745s 0.629s 0.611s 0.582s 0.708s
5,000 distinct RHS strings 0.654s 0.669s 0.722s 0.625s 0.642s
60,000 distinct RHS strings 0.852s 0.809s 0.807s 0.802s 0.822s

Measured CPU iteration seconds:

Case Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
1 distinct RHS set string 0.665s 0.518s 0.580s 0.581s 0.597s
2 distinct RHS set strings 0.476s 0.441s 0.431s 0.453s 0.448s
5,000 distinct RHS strings 0.433s 0.453s 0.455s 0.422s 0.434s
60,000 distinct RHS strings 0.414s 0.422s 0.366s 0.473s 0.424s

The dynamic RHS native path is dramatically faster than the earlier generic split/list fallback, but it is still slower than the 16-vCPU CPU baseline on this cached operator-level benchmark. The direct native path also improves the low-cardinality/repeated RHS path by avoiding per-batch distinct/dictionary setup, but the remaining gap suggests further native-side work is needed, likely a lower-overhead libcudf string primitive or kernel path specialized for scalar word search in comma-delimited strings.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@viadea viadea self-assigned this May 26, 2026
viadea added 3 commits May 26, 2026 15:07
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@sameerz sameerz added the feature request New feature or request label May 27, 2026
viadea added 5 commits May 26, 2026 18:05
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support find_in_set

3 participants