Skip to content

Add protobuf integration-test dependency infrastructure (plugin-0)#14885

Open
thirtiseven wants to merge 7 commits into
NVIDIA:mainfrom
thirtiseven:from_protobuf_plugin_0
Open

Add protobuf integration-test dependency infrastructure (plugin-0)#14885
thirtiseven wants to merge 7 commits into
NVIDIA:mainfrom
thirtiseven:from_protobuf_plugin_0

Conversation

@thirtiseven
Copy link
Copy Markdown
Collaborator

@thirtiseven thirtiseven commented May 26, 2026

Part of #14069. First slice carved out of #14354 following the plugin-side split plan in the issue.

Description

Problem

from_protobuf ships in the optional spark-protobuf module, which is not on the integration-test classpath by default. Subsequent from_protobuf GPU PRs need a stable CPU baseline that can:

  • pull the optional jars on demand without forcing them on every CI lane,
  • detect missing jars and skip cleanly so unrelated lanes are not broken.

User-facing changes

A new opt-in env var on run_pyspark_from_build.sh:

  • INCLUDE_SPARK_PROTOBUF_JAR (default true). Set false to skip protobuf tests.

Mechanics mirror the existing spark-avro integration:

  1. integration_tests/pom.xml declares spark-protobuf_${scala.binary.version} plus an unshaded protobuf-java (3.25.5) in maven-dependency-plugin. Both are copied into target/dependency/ during the package phase, pinned through Maven (no shell-side version mapping).
  2. run_pyspark_from_build.sh globs both jars from target/dependency/ (or LOCAL_JAR_PATH for prebuilt runs) and appends them to --jars.

spark-protobuf shades its own com.google.protobuf into org.sparkproject.spark_protobuf.protobuf, and Spark does not bundle the unshaded jar — that is why the second protobuf-java artifact is needed (the smoke tests build descriptors via com.google.protobuf.DescriptorProtos).

No new spark-rapids configs and no behavioral change to existing CI lanes.

Solution

Three files in integration_tests/:

  • pom.xml: two new <artifactItem>s and matching cleanup globs, alongside the existing spark-avro block.
  • run_pyspark_from_build.sh: glob the copied jars, gate them behind INCLUDE_SPARK_PROTOBUF_JAR, append to ALL_JARS. --jars already reaches both driver and executor classpath in client mode, so no separate --driver-class-path plumbing is needed (avro doesn't have one either).
  • protobuf_test.py (new): two minimal fallback-only smoke tests that build a FileDescriptorSet through the JVM, hand-encode a couple of proto2 messages, and exercise both the path-based and (Spark 3.5+) binaryDescriptorSet from_protobuf API variants.

No GPU code is touched. GPU support for from_protobuf is not yet enabled, so the smoke tests use @allow_non_gpu("ProjectExec", "ProtobufDataToCatalyst") + assert_gpu_fallback_collect to verify the plugin falls back to CPU while still producing correct results.

Scope-limited intentionally — proto resource files, the test data generator helpers, and the full protobuf_test.py from #14354 belong to later slices (plugin-1a / 1b / 1c).

Testing

New tests in integration_tests/src/main/python/protobuf_test.py:

  • test_from_protobuf_smoke_path_api — path-based descriptor API, all Spark 3.4+
  • test_from_protobuf_smoke_binary_descriptor_apibinaryDescriptorSet API, auto-skipped on Spark 3.4

Both run through assert_gpu_fallback_collect("ProtobufDataToCatalyst"), which asserts that the GPU plan contains the CPU fallback expression and that CPU/GPU results match.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@thirtiseven thirtiseven force-pushed the from_protobuf_plugin_0 branch 6 times, most recently from 4cd0cfd to 6023510 Compare May 26, 2026 10:51
Wires the optional `spark-protobuf` module into the integration-test
classpath so subsequent from_protobuf PRs have a CPU baseline. Follows
the same pattern as `spark-avro`.

* `integration_tests/pom.xml`: declare `spark-protobuf_${scala.binary.version}`
  and an unshaded `protobuf-java` (3.25.5) in `maven-dependency-plugin` so
  they are copied into `target/dependency/` during the `package` phase.
  spark-protobuf is a Spark 3.4.0+ module, so the protobuf copy lives in
  its own execution gated by `spark.protobuf.skipCopy` (set to `true` by
  the `release33x` profiles in the root pom). The unshaded `protobuf-java`
  is required because spark-protobuf shades its own copy into
  `org.sparkproject.spark_protobuf.protobuf` and Spark itself does not
  bundle the unshaded jar.
* `run_pyspark_from_build.sh`: glob both jars from `target/dependency/`
  (or `LOCAL_JAR_PATH`), gate them behind `INCLUDE_SPARK_PROTOBUF_JAR`
  (default `true`), and append them to `ALL_JARS`. `--jars` already
  reaches both driver and executor classpath in client mode, so no
  separate `--driver-class-path` plumbing is needed.
* `protobuf_test.py` (new): two minimal fallback-only smoke tests that
  build a `FileDescriptorSet` through the JVM, hand-encode a couple of
  proto2 messages, and exercise both the path-based and (Spark 3.5+)
  `binaryDescriptorSet` `from_protobuf` API variants. GPU support is not
  enabled yet, so they use `@allow_non_gpu` + `assert_gpu_fallback_collect`
  to verify the plugin falls back to CPU while still producing correct
  results.

No GPU code is changed in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thirtiseven thirtiseven force-pushed the from_protobuf_plugin_0 branch from 6023510 to d8e2530 Compare May 26, 2026 10:52
@thirtiseven thirtiseven marked this pull request as ready for review May 27, 2026 08:32
@thirtiseven thirtiseven requested a review from a team as a code owner May 27, 2026 08:32
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR adds the infrastructure needed to include spark-protobuf and protobuf-java on the integration-test classpath on demand, gated by a new INCLUDE_SPARK_PROTOBUF_JAR env var (default: include). It also introduces two minimal fallback smoke tests that verify from_protobuf CPU-falls-back correctly under spark-rapids.

  • Maven (pom.xml, integration_tests/pom.xml): a new copy-spark-protobuf execution copies both jars to target/dependency/ during package; skipped by spark.protobuf.skipCopy=true on all pre-3.4 Spark profiles.
  • run_pyspark_from_build.sh: discovers the two jars via glob, validates exactly 2 resolve, emits a stderr warning if explicitly requested but absent, and appends them to ALL_JARS.
  • protobuf_test.py: two smoke tests using hand-rolled varint encoding and JVM-side descriptor construction; skip cleanly when the jar or Spark version prerequisites are unmet; verified via assert_gpu_fallback_collect.

Confidence Score: 5/5

Safe to merge — no GPU or production code paths are touched; all changes are integration-test infrastructure and CI opt-in mechanics.

The change is narrowly scoped to test infrastructure: Maven dependency copy, shell jar discovery, and two Python smoke tests. The skip logic is defensive (defaults to no-op on pre-3.4 builds), mirrors the proven avro pattern, and includes explicit user-facing warnings for misconfiguration. No correctness or resource-management concerns were identified.

The hardcoded protobuf-java version in integration_tests/pom.xml and scala2.13/integration_tests/pom.xml is the only item worth a follow-up look, but it does not affect correctness.

Important Files Changed

Filename Overview
integration_tests/pom.xml Adds maven-dependency-plugin execution to copy spark-protobuf and protobuf-java jars; guards copy behind spark.protobuf.skipCopy; adds cleanup globs. protobuf-java version hardcoded to 3.25.5 rather than a shared Maven property.
integration_tests/run_pyspark_from_build.sh Adds PROTOBUF_JARS discovery in both LOCAL_JAR_PATH and build-target paths; gates inclusion behind INCLUDE_SPARK_PROTOBUF_JAR (default include); emits stderr warning when explicitly requested but jars are absent; appends to ALL_JARS. Pattern mirrors avro handling.
integration_tests/src/main/python/protobuf_test.py New smoke-test module; skips cleanly when jar or Spark version prerequisites are unmet; correctly uses assert_gpu_fallback_collect; hand-rolled varint encoder handles negative int32 via 64-bit masking per protobuf spec; runtime API detection via inspect.signature for the 3.5+ binaryDescriptorSet path.
pom.xml Sets spark.protobuf.skipCopy=false as global default (3.4+ builds) and overrides to true in each pre-3.4 profile; correctly limits copy to Spark versions that ship spark-protobuf.
scala2.13/integration_tests/pom.xml Scala 2.13 mirror of integration_tests/pom.xml changes; same hardcoded protobuf-java version concern applies.
scala2.13/pom.xml Scala 2.13 mirror of root pom.xml changes; spark.protobuf.skipCopy default and pre-3.4 profile overrides are symmetric with the main pom.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_pyspark_from_build.sh] --> B{LOCAL_JAR_PATH set?}
    B -- yes --> C[PROTOBUF_JARS = LOCAL_JAR_PATH/spark-protobuf*.jar + protobuf-java-*.jar]
    B -- no --> D[PROTOBUF_JARS = TARGET_DIR/dependency/spark-protobuf*.jar + protobuf-java-*.jar]
    C --> E{INCLUDE_SPARK_PROTOBUF_JAR != 'false' AND readlink resolves exactly 2 jars?}
    D --> E
    E -- yes --> F[export INCLUDE_SPARK_PROTOBUF_JAR=true Append jars to ALL_JARS]
    E -- no --> G{Was INCLUDE_SPARK_PROTOBUF_JAR explicitly 'true'?}
    G -- yes --> H[stderr WARNING: jars missing export INCLUDE_SPARK_PROTOBUF_JAR=false PROTOBUF_JARS='']
    G -- no --> I[export INCLUDE_SPARK_PROTOBUF_JAR=false PROTOBUF_JARS='']
    F --> J[pytest: protobuf_test.py runs]
    H --> K[pytest: protobuf_test.py skipped]
    I --> K
Loading

Reviews (5): Last reviewed commit: "Drop stale review-context comment" | Re-trigger Greptile

Comment thread integration_tests/run_pyspark_from_build.sh Outdated
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

Re: greptile summary — the protobuf-java 3.25.5 literal appears in two poms (integration_tests/pom.xml and scala2.13/integration_tests/pom.xml), but the Scala 2.13 pom is generated from the Scala 2.12 one via build/make-scala-version-build-files.sh, so the two copies are mirror-synced rather than independently maintained. Leaving it as a literal for now to keep this slice minimal; happy to lift it to a parent-pom property in a follow-up if reviewers prefer.

Inline P2 (silent override on missing jars) is addressed in the latest commit.

Surface a stderr warning when the variable is explicitly requested but
the spark-protobuf/protobuf-java jars are not present, so a CI
misconfiguration is not masked as a silent skip. Default opt-out
(unset or false) stays silent.

Addresses greptile review feedback on NVIDIA#14885.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

def run(spark):
return _make_smoke_df(spark).select(
from_protobuf_fn(f.col("bin"), "test.Simple", desc_path).alias("d"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please validate that these tests work on a distributed setup like with HDFS? According to AI (which could totally be wrong) desc_path is read as a local file by pyspark. Because desc_path is written to with spark_tmp_path, which is a distributed setup for things like Dataproc, it could fail.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it did not work on HDFS. Updated to spark_tmp_path now.

thirtiseven added a commit to thirtiseven/spark-rapids that referenced this pull request May 28, 2026
Surface a stderr warning when the variable is explicitly requested but
the spark-protobuf/protobuf-java jars are not present, so a CI
misconfiguration is not masked as a silent skip. Default opt-out
(unset or false) stays silent.

Addresses greptile review feedback on NVIDIA#14885.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven force-pushed the from_protobuf_plugin_0 branch from 8a7b4e2 to c04d075 Compare May 28, 2026 07:26
thirtiseven added a commit to thirtiseven/spark-rapids that referenced this pull request May 28, 2026
spark-protobuf's path-based API reads `descFilePath` with
`new File(...)` + `FileUtils.readFileToByteArray` (driver-local read),
not via Hadoop FileSystem. Writing the descriptor to `spark_tmp_path`
worked under local mode because the Hadoop FS defaults to `file://` and
resolves to the driver's disk, but on Dataproc / HDFS-backed setups
`spark_tmp_path` resolves to a remote URI and the driver's `new File()`
cannot read it.

Replace the Hadoop FS write with a `tempfile.mkstemp()` on the driver
and clean it up via a pytest finalizer.

Addresses NVIDIA#14885 review feedback from revans2.
@thirtiseven thirtiseven force-pushed the from_protobuf_plugin_0 branch from c04d075 to 989aa98 Compare May 28, 2026 07:28
thirtiseven added a commit to thirtiseven/spark-rapids that referenced this pull request May 28, 2026
spark-protobuf's path-based API reads `descFilePath` with
`new File(...)` + `FileUtils.readFileToByteArray` (driver-local read),
not via Hadoop FileSystem. Writing the descriptor to `spark_tmp_path`
worked under local mode because the Hadoop FS defaults to `file://` and
resolves to the driver's disk, but on Dataproc / HDFS-backed setups
`spark_tmp_path` resolves to a remote URI and the driver's `new File()`
cannot read it.

Replace the Hadoop FS write with a `tempfile.mkstemp()` on the driver
and clean it up via a pytest finalizer.

Addresses NVIDIA#14885 review feedback from revans2.
@thirtiseven thirtiseven force-pushed the from_protobuf_plugin_0 branch from 989aa98 to 9c1f33a Compare May 28, 2026 07:33
spark-protobuf's path-based API reads `descFilePath` with
`new File(...)` + `FileUtils.readFileToByteArray` (driver-local read),
not via Hadoop FileSystem. The original implementation wrote the
descriptor through Hadoop FS, which only worked in local mode because
the default fs is `file://` and resolves to the same driver-local
path; on a distributed setup `spark_tmp_path` would resolve to
HDFS / GCS and the driver's `new File()` would fail.

Switch to plain Python `open()` against `spark_tmp_path`, mirroring
the convention already used by `json_fuzz_test.py` and
`delta_lake_test.py` (both write driver-local files into
`spark_tmp_path` the same way).

Addresses NVIDIA#14885 review feedback from revans2.
@thirtiseven thirtiseven force-pushed the from_protobuf_plugin_0 branch from 9c1f33a to 3dc8dbb Compare May 28, 2026 07:33
Drop the WHAT/recap halves from the comments introduced earlier in this
PR; keep only the WHY parts (spark-protobuf shading and the Spark 3.4.0+
module constraint).
@thirtiseven thirtiseven self-assigned this May 29, 2026
@thirtiseven thirtiseven requested a review from revans2 May 29, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants