Add protobuf integration-test dependency infrastructure (plugin-0) by thirtiseven · Pull Request #14885 · NVIDIA/spark-rapids

thirtiseven · 2026-05-26T06:27:12Z

Part of #14069. First slice carved out of #14354 following the plugin-side split plan in the issue.

Description

Problem

from_protobuf ships in the optional spark-protobuf module, which is not on the integration-test classpath by default. Subsequent from_protobuf GPU PRs need a stable CPU baseline that can:

pull the optional jars on demand without forcing them on every CI lane,
detect missing jars and skip cleanly so unrelated lanes are not broken.

User-facing changes

A new opt-in env var on run_pyspark_from_build.sh:

INCLUDE_SPARK_PROTOBUF_JAR (default true). Set false to skip protobuf tests.

Mechanics mirror the existing spark-avro integration:

integration_tests/pom.xml declares spark-protobuf_${scala.binary.version} plus an unshaded protobuf-java (3.25.5) in maven-dependency-plugin. Both are copied into target/dependency/ during the package phase, pinned through Maven (no shell-side version mapping).
run_pyspark_from_build.sh globs both jars from target/dependency/ (or LOCAL_JAR_PATH for prebuilt runs) and appends them to --jars.

spark-protobuf shades its own com.google.protobuf into org.sparkproject.spark_protobuf.protobuf, and Spark does not bundle the unshaded jar — that is why the second protobuf-java artifact is needed (the smoke tests build descriptors via com.google.protobuf.DescriptorProtos).

No new spark-rapids configs and no behavioral change to existing CI lanes.

Solution

Three files in integration_tests/:

pom.xml: two new <artifactItem>s and matching cleanup globs, alongside the existing spark-avro block.
run_pyspark_from_build.sh: glob the copied jars, gate them behind INCLUDE_SPARK_PROTOBUF_JAR, append to ALL_JARS. --jars already reaches both driver and executor classpath in client mode, so no separate --driver-class-path plumbing is needed (avro doesn't have one either).
protobuf_test.py (new): two minimal fallback-only smoke tests that build a FileDescriptorSet through the JVM, hand-encode a couple of proto2 messages, and exercise both the path-based and (Spark 3.5+) binaryDescriptorSet from_protobuf API variants.

No GPU code is touched. GPU support for from_protobuf is not yet enabled, so the smoke tests use @allow_non_gpu("ProjectExec", "ProtobufDataToCatalyst") + assert_gpu_fallback_collect to verify the plugin falls back to CPU while still producing correct results.

Scope-limited intentionally — proto resource files, the test data generator helpers, and the full protobuf_test.py from #14354 belong to later slices (plugin-1a / 1b / 1c).

Testing

New tests in integration_tests/src/main/python/protobuf_test.py:

test_from_protobuf_smoke_path_api — path-based descriptor API, all Spark 3.4+
test_from_protobuf_smoke_binary_descriptor_api — binaryDescriptorSet API, auto-skipped on Spark 3.4

Both run through assert_gpu_fallback_collect("ProtobufDataToCatalyst"), which asserts that the GPU plan contains the CPU fallback expression and that CPU/GPU results match.

Checklists

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
(Please provide the names of the existing tests in the PR description.)
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

Wires the optional `spark-protobuf` module into the integration-test classpath so subsequent from_protobuf PRs have a CPU baseline. Follows the same pattern as `spark-avro`. * `integration_tests/pom.xml`: declare `spark-protobuf_${scala.binary.version}` and an unshaded `protobuf-java` (3.25.5) in `maven-dependency-plugin` so they are copied into `target/dependency/` during the `package` phase. spark-protobuf is a Spark 3.4.0+ module, so the protobuf copy lives in its own execution gated by `spark.protobuf.skipCopy` (set to `true` by the `release33x` profiles in the root pom). The unshaded `protobuf-java` is required because spark-protobuf shades its own copy into `org.sparkproject.spark_protobuf.protobuf` and Spark itself does not bundle the unshaded jar. * `run_pyspark_from_build.sh`: glob both jars from `target/dependency/` (or `LOCAL_JAR_PATH`), gate them behind `INCLUDE_SPARK_PROTOBUF_JAR` (default `true`), and append them to `ALL_JARS`. `--jars` already reaches both driver and executor classpath in client mode, so no separate `--driver-class-path` plumbing is needed. * `protobuf_test.py` (new): two minimal fallback-only smoke tests that build a `FileDescriptorSet` through the JVM, hand-encode a couple of proto2 messages, and exercise both the path-based and (Spark 3.5+) `binaryDescriptorSet` `from_protobuf` API variants. GPU support is not enabled yet, so they use `@allow_non_gpu` + `assert_gpu_fallback_collect` to verify the plugin falls back to CPU while still producing correct results. No GPU code is changed in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps · 2026-05-27T08:39:33Z

Greptile Summary

This PR adds the infrastructure needed to include spark-protobuf and protobuf-java on the integration-test classpath on demand, gated by a new INCLUDE_SPARK_PROTOBUF_JAR env var (default: include). It also introduces two minimal fallback smoke tests that verify from_protobuf CPU-falls-back correctly under spark-rapids.

Maven (pom.xml, integration_tests/pom.xml): a new copy-spark-protobuf execution copies both jars to target/dependency/ during package; skipped by spark.protobuf.skipCopy=true on all pre-3.4 Spark profiles.
run_pyspark_from_build.sh: discovers the two jars via glob, validates exactly 2 resolve, emits a stderr warning if explicitly requested but absent, and appends them to ALL_JARS.
protobuf_test.py: two smoke tests using hand-rolled varint encoding and JVM-side descriptor construction; skip cleanly when the jar or Spark version prerequisites are unmet; verified via assert_gpu_fallback_collect.

Confidence Score: 5/5

Safe to merge — no GPU or production code paths are touched; all changes are integration-test infrastructure and CI opt-in mechanics.

The change is narrowly scoped to test infrastructure: Maven dependency copy, shell jar discovery, and two Python smoke tests. The skip logic is defensive (defaults to no-op on pre-3.4 builds), mirrors the proven avro pattern, and includes explicit user-facing warnings for misconfiguration. No correctness or resource-management concerns were identified.

The hardcoded protobuf-java version in integration_tests/pom.xml and scala2.13/integration_tests/pom.xml is the only item worth a follow-up look, but it does not affect correctness.

Important Files Changed

Filename	Overview
integration_tests/pom.xml	Adds maven-dependency-plugin execution to copy spark-protobuf and protobuf-java jars; guards copy behind `spark.protobuf.skipCopy`; adds cleanup globs. `protobuf-java` version hardcoded to 3.25.5 rather than a shared Maven property.
integration_tests/run_pyspark_from_build.sh	Adds PROTOBUF_JARS discovery in both LOCAL_JAR_PATH and build-target paths; gates inclusion behind INCLUDE_SPARK_PROTOBUF_JAR (default include); emits stderr warning when explicitly requested but jars are absent; appends to ALL_JARS. Pattern mirrors avro handling.
integration_tests/src/main/python/protobuf_test.py	New smoke-test module; skips cleanly when jar or Spark version prerequisites are unmet; correctly uses assert_gpu_fallback_collect; hand-rolled varint encoder handles negative int32 via 64-bit masking per protobuf spec; runtime API detection via inspect.signature for the 3.5+ binaryDescriptorSet path.
pom.xml	Sets `spark.protobuf.skipCopy=false` as global default (3.4+ builds) and overrides to `true` in each pre-3.4 profile; correctly limits copy to Spark versions that ship spark-protobuf.
scala2.13/integration_tests/pom.xml	Scala 2.13 mirror of integration_tests/pom.xml changes; same hardcoded protobuf-java version concern applies.
scala2.13/pom.xml	Scala 2.13 mirror of root pom.xml changes; `spark.protobuf.skipCopy` default and pre-3.4 profile overrides are symmetric with the main pom.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_pyspark_from_build.sh] --> B{LOCAL_JAR_PATH set?}
    B -- yes --> C[PROTOBUF_JARS = LOCAL_JAR_PATH/spark-protobuf*.jar + protobuf-java-*.jar]
    B -- no --> D[PROTOBUF_JARS = TARGET_DIR/dependency/spark-protobuf*.jar + protobuf-java-*.jar]
    C --> E{INCLUDE_SPARK_PROTOBUF_JAR != 'false' AND readlink resolves exactly 2 jars?}
    D --> E
    E -- yes --> F[export INCLUDE_SPARK_PROTOBUF_JAR=true Append jars to ALL_JARS]
    E -- no --> G{Was INCLUDE_SPARK_PROTOBUF_JAR explicitly 'true'?}
    G -- yes --> H[stderr WARNING: jars missing export INCLUDE_SPARK_PROTOBUF_JAR=false PROTOBUF_JARS='']
    G -- no --> I[export INCLUDE_SPARK_PROTOBUF_JAR=false PROTOBUF_JARS='']
    F --> J[pytest: protobuf_test.py runs]
    H --> K[pytest: protobuf_test.py skipped]
    I --> K

_{Reviews (5): Last reviewed commit: "Drop stale review-context comment" | Re-trigger Greptile}

thirtiseven · 2026-05-27T09:08:41Z

Re: greptile summary — the protobuf-java 3.25.5 literal appears in two poms (integration_tests/pom.xml and scala2.13/integration_tests/pom.xml), but the Scala 2.13 pom is generated from the Scala 2.12 one via build/make-scala-version-build-files.sh, so the two copies are mirror-synced rather than independently maintained. Leaving it as a literal for now to keep this slice minimal; happy to lift it to a parent-pom property in a follow-up if reviewers prefer.

Inline P2 (silent override on missing jars) is addressed in the latest commit.

Surface a stderr warning when the variable is explicitly requested but the spark-protobuf/protobuf-java jars are not present, so a CI misconfiguration is not masked as a silent skip. Default opt-out (unset or false) stays silent. Addresses greptile review feedback on NVIDIA#14885. Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revans2 · 2026-05-27T16:49:51Z

+
+    def run(spark):
+        return _make_smoke_df(spark).select(
+            from_protobuf_fn(f.col("bin"), "test.Simple", desc_path).alias("d"))


Could we please validate that these tests work on a distributed setup like with HDFS? According to AI (which could totally be wrong) desc_path is read as a local file by pyspark. Because desc_path is written to with spark_tmp_path, which is a distributed setup for things like Dataproc, it could fail.

Good catch, it did not work on HDFS. Updated to spark_tmp_path now.

Surface a stderr warning when the variable is explicitly requested but the spark-protobuf/protobuf-java jars are not present, so a CI misconfiguration is not masked as a silent skip. Default opt-out (unset or false) stays silent. Addresses greptile review feedback on NVIDIA#14885. Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

spark-protobuf's path-based API reads `descFilePath` with `new File(...)` + `FileUtils.readFileToByteArray` (driver-local read), not via Hadoop FileSystem. Writing the descriptor to `spark_tmp_path` worked under local mode because the Hadoop FS defaults to `file://` and resolves to the driver's disk, but on Dataproc / HDFS-backed setups `spark_tmp_path` resolves to a remote URI and the driver's `new File()` cannot read it. Replace the Hadoop FS write with a `tempfile.mkstemp()` on the driver and clean it up via a pytest finalizer. Addresses NVIDIA#14885 review feedback from revans2.

spark-protobuf's path-based API reads `descFilePath` with `new File(...)` + `FileUtils.readFileToByteArray` (driver-local read), not via Hadoop FileSystem. The original implementation wrote the descriptor through Hadoop FS, which only worked in local mode because the default fs is `file://` and resolves to the same driver-local path; on a distributed setup `spark_tmp_path` would resolve to HDFS / GCS and the driver's `new File()` would fail. Switch to plain Python `open()` against `spark_tmp_path`, mirroring the convention already used by `json_fuzz_test.py` and `delta_lake_test.py` (both write driver-local files into `spark_tmp_path` the same way). Addresses NVIDIA#14885 review feedback from revans2.

Drop the WHAT/recap halves from the comments introduced earlier in this PR; keep only the WHY parts (spark-protobuf shading and the Spark 3.4.0+ module constraint).

thirtiseven force-pushed the from_protobuf_plugin_0 branch 6 times, most recently from 4cd0cfd to 6023510 Compare May 26, 2026 10:51

thirtiseven force-pushed the from_protobuf_plugin_0 branch from 6023510 to d8e2530 Compare May 26, 2026 10:52

thirtiseven added 2 commits May 27, 2026 11:24

Merge remote-tracking branch 'origin/main' into from_protobuf_plugin_0

9e23bdd

signoff

ae0e557

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven marked this pull request as ready for review May 27, 2026 08:32

thirtiseven requested a review from a team as a code owner May 27, 2026 08:32

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

Comment thread integration_tests/run_pyspark_from_build.sh Outdated

revans2 reviewed May 27, 2026

View reviewed changes

thirtiseven force-pushed the from_protobuf_plugin_0 branch from 8a7b4e2 to c04d075 Compare May 28, 2026 07:26

thirtiseven force-pushed the from_protobuf_plugin_0 branch from c04d075 to 989aa98 Compare May 28, 2026 07:28

thirtiseven force-pushed the from_protobuf_plugin_0 branch from 989aa98 to 9c1f33a Compare May 28, 2026 07:33

thirtiseven force-pushed the from_protobuf_plugin_0 branch from 9c1f33a to 3dc8dbb Compare May 28, 2026 07:33

thirtiseven added 2 commits May 28, 2026 15:57

Trim comments to WHY-only

f1f780f

Drop the WHAT/recap halves from the comments introduced earlier in this PR; keep only the WHY parts (spark-protobuf shading and the Spark 3.4.0+ module constraint).

Drop stale review-context comment

899625a

thirtiseven self-assigned this May 29, 2026

thirtiseven requested a review from revans2 May 29, 2026 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add protobuf integration-test dependency infrastructure (plugin-0)#14885

Add protobuf integration-test dependency infrastructure (plugin-0)#14885
thirtiseven wants to merge 7 commits into
NVIDIA:mainfrom
thirtiseven:from_protobuf_plugin_0

thirtiseven commented May 26, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

thirtiseven commented May 27, 2026

Uh oh!

revans2 May 27, 2026

Uh oh!

thirtiseven May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thirtiseven commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

User-facing changes

Solution

Testing

Checklists

Uh oh!

greptile-apps Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

thirtiseven commented May 27, 2026

Uh oh!

revans2 May 27, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thirtiseven commented May 26, 2026 •

edited

Loading

greptile-apps Bot commented May 27, 2026 •

edited

Loading