Add protobuf integration-test dependency infrastructure (plugin-0)#14885
Add protobuf integration-test dependency infrastructure (plugin-0)#14885thirtiseven wants to merge 7 commits into
Conversation
4cd0cfd to
6023510
Compare
Wires the optional `spark-protobuf` module into the integration-test
classpath so subsequent from_protobuf PRs have a CPU baseline. Follows
the same pattern as `spark-avro`.
* `integration_tests/pom.xml`: declare `spark-protobuf_${scala.binary.version}`
and an unshaded `protobuf-java` (3.25.5) in `maven-dependency-plugin` so
they are copied into `target/dependency/` during the `package` phase.
spark-protobuf is a Spark 3.4.0+ module, so the protobuf copy lives in
its own execution gated by `spark.protobuf.skipCopy` (set to `true` by
the `release33x` profiles in the root pom). The unshaded `protobuf-java`
is required because spark-protobuf shades its own copy into
`org.sparkproject.spark_protobuf.protobuf` and Spark itself does not
bundle the unshaded jar.
* `run_pyspark_from_build.sh`: glob both jars from `target/dependency/`
(or `LOCAL_JAR_PATH`), gate them behind `INCLUDE_SPARK_PROTOBUF_JAR`
(default `true`), and append them to `ALL_JARS`. `--jars` already
reaches both driver and executor classpath in client mode, so no
separate `--driver-class-path` plumbing is needed.
* `protobuf_test.py` (new): two minimal fallback-only smoke tests that
build a `FileDescriptorSet` through the JVM, hand-encode a couple of
proto2 messages, and exercise both the path-based and (Spark 3.5+)
`binaryDescriptorSet` `from_protobuf` API variants. GPU support is not
enabled yet, so they use `@allow_non_gpu` + `assert_gpu_fallback_collect`
to verify the plugin falls back to CPU while still producing correct
results.
No GPU code is changed in this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6023510 to
d8e2530
Compare
Greptile SummaryThis PR adds the infrastructure needed to include
Confidence Score: 5/5Safe to merge — no GPU or production code paths are touched; all changes are integration-test infrastructure and CI opt-in mechanics. The change is narrowly scoped to test infrastructure: Maven dependency copy, shell jar discovery, and two Python smoke tests. The skip logic is defensive (defaults to no-op on pre-3.4 builds), mirrors the proven avro pattern, and includes explicit user-facing warnings for misconfiguration. No correctness or resource-management concerns were identified. The hardcoded Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[run_pyspark_from_build.sh] --> B{LOCAL_JAR_PATH set?}
B -- yes --> C[PROTOBUF_JARS = LOCAL_JAR_PATH/spark-protobuf*.jar + protobuf-java-*.jar]
B -- no --> D[PROTOBUF_JARS = TARGET_DIR/dependency/spark-protobuf*.jar + protobuf-java-*.jar]
C --> E{INCLUDE_SPARK_PROTOBUF_JAR != 'false' AND readlink resolves exactly 2 jars?}
D --> E
E -- yes --> F[export INCLUDE_SPARK_PROTOBUF_JAR=true Append jars to ALL_JARS]
E -- no --> G{Was INCLUDE_SPARK_PROTOBUF_JAR explicitly 'true'?}
G -- yes --> H[stderr WARNING: jars missing export INCLUDE_SPARK_PROTOBUF_JAR=false PROTOBUF_JARS='']
G -- no --> I[export INCLUDE_SPARK_PROTOBUF_JAR=false PROTOBUF_JARS='']
F --> J[pytest: protobuf_test.py runs]
H --> K[pytest: protobuf_test.py skipped]
I --> K
Reviews (5): Last reviewed commit: "Drop stale review-context comment" | Re-trigger Greptile |
|
Re: greptile summary — the Inline P2 (silent override on missing jars) is addressed in the latest commit. |
Surface a stderr warning when the variable is explicitly requested but the spark-protobuf/protobuf-java jars are not present, so a CI misconfiguration is not masked as a silent skip. Default opt-out (unset or false) stays silent. Addresses greptile review feedback on NVIDIA#14885. Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
|
||
| def run(spark): | ||
| return _make_smoke_df(spark).select( | ||
| from_protobuf_fn(f.col("bin"), "test.Simple", desc_path).alias("d")) |
There was a problem hiding this comment.
Could we please validate that these tests work on a distributed setup like with HDFS? According to AI (which could totally be wrong) desc_path is read as a local file by pyspark. Because desc_path is written to with spark_tmp_path, which is a distributed setup for things like Dataproc, it could fail.
There was a problem hiding this comment.
Good catch, it did not work on HDFS. Updated to spark_tmp_path now.
Surface a stderr warning when the variable is explicitly requested but the spark-protobuf/protobuf-java jars are not present, so a CI misconfiguration is not masked as a silent skip. Default opt-out (unset or false) stays silent. Addresses greptile review feedback on NVIDIA#14885. Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
8a7b4e2 to
c04d075
Compare
spark-protobuf's path-based API reads `descFilePath` with `new File(...)` + `FileUtils.readFileToByteArray` (driver-local read), not via Hadoop FileSystem. Writing the descriptor to `spark_tmp_path` worked under local mode because the Hadoop FS defaults to `file://` and resolves to the driver's disk, but on Dataproc / HDFS-backed setups `spark_tmp_path` resolves to a remote URI and the driver's `new File()` cannot read it. Replace the Hadoop FS write with a `tempfile.mkstemp()` on the driver and clean it up via a pytest finalizer. Addresses NVIDIA#14885 review feedback from revans2.
c04d075 to
989aa98
Compare
spark-protobuf's path-based API reads `descFilePath` with `new File(...)` + `FileUtils.readFileToByteArray` (driver-local read), not via Hadoop FileSystem. Writing the descriptor to `spark_tmp_path` worked under local mode because the Hadoop FS defaults to `file://` and resolves to the driver's disk, but on Dataproc / HDFS-backed setups `spark_tmp_path` resolves to a remote URI and the driver's `new File()` cannot read it. Replace the Hadoop FS write with a `tempfile.mkstemp()` on the driver and clean it up via a pytest finalizer. Addresses NVIDIA#14885 review feedback from revans2.
989aa98 to
9c1f33a
Compare
spark-protobuf's path-based API reads `descFilePath` with `new File(...)` + `FileUtils.readFileToByteArray` (driver-local read), not via Hadoop FileSystem. The original implementation wrote the descriptor through Hadoop FS, which only worked in local mode because the default fs is `file://` and resolves to the same driver-local path; on a distributed setup `spark_tmp_path` would resolve to HDFS / GCS and the driver's `new File()` would fail. Switch to plain Python `open()` against `spark_tmp_path`, mirroring the convention already used by `json_fuzz_test.py` and `delta_lake_test.py` (both write driver-local files into `spark_tmp_path` the same way). Addresses NVIDIA#14885 review feedback from revans2.
9c1f33a to
3dc8dbb
Compare
Drop the WHAT/recap halves from the comments introduced earlier in this PR; keep only the WHY parts (spark-protobuf shading and the Spark 3.4.0+ module constraint).
Part of #14069. First slice carved out of #14354 following the plugin-side split plan in the issue.
Description
Problem
from_protobufships in the optionalspark-protobufmodule, which is not on the integration-test classpath by default. Subsequentfrom_protobufGPU PRs need a stable CPU baseline that can:User-facing changes
A new opt-in env var on
run_pyspark_from_build.sh:INCLUDE_SPARK_PROTOBUF_JAR(defaulttrue). Setfalseto skip protobuf tests.Mechanics mirror the existing
spark-avrointegration:integration_tests/pom.xmldeclaresspark-protobuf_${scala.binary.version}plus an unshadedprotobuf-java(3.25.5) inmaven-dependency-plugin. Both are copied intotarget/dependency/during thepackagephase, pinned through Maven (no shell-side version mapping).run_pyspark_from_build.shglobs both jars fromtarget/dependency/(orLOCAL_JAR_PATHfor prebuilt runs) and appends them to--jars.spark-protobufshades its owncom.google.protobufintoorg.sparkproject.spark_protobuf.protobuf, and Spark does not bundle the unshaded jar — that is why the secondprotobuf-javaartifact is needed (the smoke tests build descriptors viacom.google.protobuf.DescriptorProtos).No new spark-rapids configs and no behavioral change to existing CI lanes.
Solution
Three files in
integration_tests/:pom.xml: two new<artifactItem>s and matching cleanup globs, alongside the existingspark-avroblock.run_pyspark_from_build.sh: glob the copied jars, gate them behindINCLUDE_SPARK_PROTOBUF_JAR, append toALL_JARS.--jarsalready reaches both driver and executor classpath in client mode, so no separate--driver-class-pathplumbing is needed (avro doesn't have one either).protobuf_test.py(new): two minimal fallback-only smoke tests that build aFileDescriptorSetthrough the JVM, hand-encode a couple of proto2 messages, and exercise both the path-based and (Spark 3.5+)binaryDescriptorSetfrom_protobufAPI variants.No GPU code is touched. GPU support for
from_protobufis not yet enabled, so the smoke tests use@allow_non_gpu("ProjectExec", "ProtobufDataToCatalyst")+assert_gpu_fallback_collectto verify the plugin falls back to CPU while still producing correct results.Scope-limited intentionally — proto resource files, the test data generator helpers, and the full
protobuf_test.pyfrom #14354 belong to later slices (plugin-1a / 1b / 1c).Testing
New tests in
integration_tests/src/main/python/protobuf_test.py:test_from_protobuf_smoke_path_api— path-based descriptor API, all Spark 3.4+test_from_protobuf_smoke_binary_descriptor_api—binaryDescriptorSetAPI, auto-skipped on Spark 3.4Both run through
assert_gpu_fallback_collect("ProtobufDataToCatalyst"), which asserts that the GPU plan contains the CPU fallback expression and that CPU/GPU results match.Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance