Add GPU support for find_in_set#14889
Draft
viadea wants to merge 9 commits into
Draft
Conversation
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #8627.
Description
This adds GPU support for Spark
find_in_set, which previously fell back because the accelerator did not register a GPU expression fororg.apache.spark.sql.catalyst.expressions.FindInSet.The implementation registers
FindInSetinGpuOverridesand addsGpuFindInSet. For a columnar set argument, the implementation splits the comma-delimited set string, looks up the left-hand string in the resulting list, and returns Spark's 1-based index result as an integer.For a scalar/literal set argument, the implementation avoids expanding and splitting the same set once per input row. Instead, it builds a small token-to-first-position dictionary for the set and maps the word column through that dictionary on the GPU. Scalar/scalar inputs are evaluated once and expanded to the batch size.
For a scalar word and columnar set argument, the implementation now has two GPU paths. When a Spark RAPIDS JNI build exposes
StringUtils.findInSet, it calls that native single-pass GPU implementation for the dynamic RHS column. Until that JNI dependency is available, the code falls back to the low-cardinality repeated RHS path added here: it computes distinct RHS set strings per batch, evaluates the scalar word position once per distinct set on the host, and gathers the row results back on the GPU. A single-distinct RHS batch fills the output from one scalar position. High-cardinality RHS batches fall back to the generic split/list implementation to avoid copying large set columns to the host.The native JNI follow-up is NVIDIA/spark-rapids-jni#4636. It keeps the Spark-specific
find_in_setsemantics in Spark RAPIDS JNI: null RHS rows remain null, missing values return 0, duplicate tokens return the first position, empty tokens are preserved, and words containing commas return 0. This is the preferred best-performance path for dynamic RHS columns. A cuDF PR is not filed yet because the current API is Spark-specific; the implementation can be promoted to libcudf if JNI review asks for a reusable cuDF string primitive.The implementation handles scalar and columnar input combinations and preserves null-intolerant behavior, empty-token behavior, duplicate-token first-match behavior, and Spark's rule that words containing commas return 0.
This is a user-facing support change, so the generated supported-operator documentation and advanced configuration metadata are updated to include
find_in_set.Testing and performance
Test coverage adds
test_find_in_setinstring_test.py, covering literal and column inputs, missing values, empty strings, words containing commas, nulls, duplicate-token first-match behavior, and multibyte characters.Validation:
git diff --checkbuild/buildall --profile=353 --module=dist --parallel=1JAVA_HOME=/opt/homebrew/opt/openjdk@17 mvn -pl sql-plugin -am -Dbuildver=353 -DskipTests -Dmaven.scaladoc.skip compilef0e723efb0a7d65060996f7f5a266cef2088284fwith RAPIDS JNI jar from Add native find_in_set utility spark-rapids-jni#4636 commit42847fdb0720163c71453ffcd12c65b288fea488Dataproc benchmark resources:
n1-standard-16with 16 vCPU, 60 GB memory, and a 200 GB pd-standard boot diskn1-standard-16node withspark.rapids.sql.enabled=false; it was not a multi-node CPU clusterrangebenchmarks used 64 input partitionsLiteral RHS benchmark shape:
find_in_set(cast(pmod(id, 64) as string), '0,1,...,63')sum(pos), count(1)row_count=10000000andsum_pos=325000000GpuProject [find_in_set(...)],GpuHashAggregate, andGpuRangeThe literal-RHS optimized path shows about 1.7x speedup over CPU for warm iterations in this 1 T4 vs 16 vCPU operator-level benchmark.
Dynamic RHS column benchmark shape:
find_in_set('32', token_set)sum(pos)token_setas a real column and avoid Catalyst folding into a literal resultGpuProject [find_in_set(32, token_set)]spark.plugins=com.nvidia.spark.SQLPlugin, 1 executor, 15 executor cores, 32 GB executor memory, and 1 T4n1-standard-16nodeMeasured GPU iteration seconds:
Measured CPU iteration seconds:
The dynamic RHS native path is dramatically faster than the earlier generic split/list fallback, but it is still slower than the 16-vCPU CPU baseline on this cached operator-level benchmark. The direct native path also improves the low-cardinality/repeated RHS path by avoiding per-batch distinct/dictionary setup, but the remaining gap suggests further native-side work is needed, likely a lower-overhead libcudf string primitive or kernel path specialized for scalar word search in comma-delimited strings.
Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance