Conversation
- Add sedona-spark-4.1 Maven profile (Spark 4.1.0, Scala 2.13.17, Hadoop 3.4.1) - Create spark/spark-4.1 module based on spark-4.0 - Fix Geometry import ambiguity (Spark 4.1 adds o.a.s.sql.types.Geometry) - Fix WritableColumnVector.setAllNull() removal (replaced by setMissing() in 4.1) - Add sessionUUID parameter to ArrowPythonWithNamedArgumentRunner (new in 4.1) - Update docs (maven-coordinates, platform, publish) - Update CI workflows (java, example, python, docker-build)
…patibility table, refine CI matrices
6b14539 to
3c621e5
Compare
Spark 4.1's RowEncoder calls udt.getClass directly, which returns the Scala module class (e.g. GeometryUDT$) with a private constructor for case objects, causing EXPRESSION_DECODING_FAILED errors. Fix: Add apply() method to GeometryUDT, GeographyUDT, and RasterUDT case objects that return new class instances, and use UDT() instead of the bare singleton throughout schema construction code. This ensures getClass returns the public class with an accessible constructor. Also: - Revert docker-build.yml (no Spark 4.1 in Docker builds) - Bump pyspark upper bound from <4.1.0 to <4.2.0 - Bump Spark 4.1.0 to 4.1.1 in CI and POM - Fix Scala 2.13.12 vs 2.13.17 mismatch in scala2.13 profile
…failure on Python <3.10
…disable geospatial in tests
…ion, overwrite Spark native
Spark 4.1 no longer provides commons-collections 3.x transitively. Replace FilterIterator with Java 8 stream filtering in DuplicatesFilter, and IteratorUtils.toList with StreamSupport in the test.
In Spark 4.1, SparkSqlParser introduced an override for parsePlanWithParameters that bypasses parsePlan entirely. SedonaSqlParser only overrode parsePlan, so its SQL parsing interception was never invoked on Spark 4.1. Fix by also overriding parsePlanWithParameters in SedonaSqlParser to use the same delegate-first, Sedona-fallback pattern. Also disable spark.sql.geospatial.enabled in tests to prevent Spark 4.1+ native geospatial functions from shadowing Sedona's ST functions.
PySpark 4.1 returns BinaryType columns as bytes instead of bytearray. Update the isinstance check to handle both types so ST_AsBinary and ST_AsEWKB test results are properly hex-encoded before comparison.
a8833e7 to
b4c3c17
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive support for Apache Spark 4.1 in Sedona, building upon the existing Spark 3.4, 3.5, and 4.0 support.
Changes:
- Added Spark 4.1.1 build profiles and module structure (Maven POM files, CI workflows)
- Implemented API compatibility fixes for Spark 4.1 changes (ParquetColumnVector reflection, SedonaSqlParser, SedonaArrowEvalPythonExec)
- Applied SPARK-52671 UDT workaround by adding apply() factory methods to GeometryUDT, GeographyUDT, and RasterUDT case objects and replacing bare UDT references with UDT() calls throughout the codebase
Reviewed changes
Copilot reviewed 100 out of 100 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pom.xml | Added sedona-spark-4.1 Maven profile with Spark 4.1.1, Scala 2.13.17, Hadoop 3.4.1; updated Scala version to 2.13.17 for scala2.13 and sedona-spark-4.0 profiles |
| spark/pom.xml | Added spark-4.1 module entry in enable-all-submodules profile |
| spark/common/pom.xml | Added sedona-spark-4.1 profile with spark-sql-api dependency |
| spark/spark-4.1/pom.xml | New module POM with dependencies matching spark-4.0 structure |
| spark/spark-4.1/src/** | Complete Spark 4.1 module with main and test source files adapted from spark-4.0 |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/*.scala | Added apply() factory methods to GeometryUDT, GeographyUDT, RasterUDT case objects |
| spark/*/src//.scala | Replaced bare UDT singleton references with UDT() calls across all Spark modules (3.4, 3.5, 4.0, 4.1, common) |
| spark/common/src/main/java/.../ParquetColumnVector.java | Changed setAllNull() to reflection-based markAllNull() supporting both Spark ≤4.0 and ≥4.1 |
| spark/common/src/main/scala/.../Functions.scala | Added explicit Geometry import to resolve Spark 4.1 ambiguity |
| spark/spark-4.1/src/main/scala/.../SedonaArrowEvalPythonExec.scala | Added sessionUUID parameter required by Spark 4.1 ArrowEvalPythonExec |
| spark/spark-4.1/src/main/scala/.../SedonaSqlParser.scala | Overrode parsePlanWithParameters() for Spark 4.1 parse flow changes |
| python/pyproject.toml | Bumped pyspark upper bound from <4.1.0 to <4.2.0 (conditional on Python ≥3.10) |
| .github/workflows/*.yml | Added Spark 4.1.1 CI matrix entries; updated fail-fast to false |
| docs/setup/*.md | Updated documentation with Spark 4.1 compatibility tables and Maven coordinates |
| docs/community/publish.md | Updated release checklist for Spark 4.1 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import org.apache.spark.sql.types._ | ||
| import org.locationtech.jts.algorithm.MinimumBoundingCircle | ||
| import org.locationtech.jts.geom._ | ||
| import org.locationtech.jts.geom.Geometry |
There was a problem hiding this comment.
The explicit import of org.locationtech.jts.geom.Geometry is added on line 35 but there's already a wildcard import of org.locationtech.jts.geom._ on line 34. The explicit import is redundant since the wildcard import already includes Geometry. While this doesn't cause compilation errors, it's unnecessary code duplication that could be simplified by removing the explicit import on line 35.
| import org.locationtech.jts.geom.Geometry |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Support Spark 4.1 #2609What changes were proposed in this PR?
This PR adds support for Apache Spark 4.1 in Sedona.
Build scaffolding
sedona-spark-4.1Maven profile in rootpom.xml(Spark 4.1.1, Scala 2.13.17, Hadoop 3.4.1)spark-4.1module entry inspark/pom.xml(enable-all-submodulesprofile)sedona-spark-4.1profile inspark/common/pom.xmlwithspark-sql-apidependencyspark/spark-4.1/module (copied fromspark/spark-4.0/, updatedartifactId)scala2.13andsedona-spark-4.0profiles to use Scala 2.13.17Spark 4.1 API compatibility fixes
setAllNull()to reflection-basedmarkAllNull()that works with bothsetAllNull(Spark < 4.1) andsetMissing(Spark 4.1+)org.locationtech.jts.geom.Geometryimport to resolve ambiguity with neworg.apache.spark.sql.functions.Geometryin Spark 4.1sessionUUIDparameter required by Spark 4.1ArrowEvalPythonExecSPARK-52671 UDT workaround
Spark 4.1 changed
RowEncoder.encoderForDataTypeto calludt.getClassdirectly instead of looking up viaUDTRegistration. For Scalacase objectUDTs,getClassreturns the module class (e.g.,GeometryUDT$) which has a private constructor, causingScalaReflectionException.Fix: Added
apply()factory methods to all three UDT case objects (GeometryUDT,GeographyUDT,RasterUDT) andtoStringoverrides to return the simple class name. Replaced bare singleton references withUDT()calls across source files so that schema construction uses proper class instances.Parser compatibility (Spark 4.1)
Spark 4.1's
SparkSqlParserintroduced an override forparsePlanWithParametersthat callsparseInternaldirectly, bypassingparsePlan. SinceSedonaSqlParseronly overrodeparsePlan, Sedona's SQL parsing interception was never invoked.Fix: Added
parsePlanWithParametersoverride inSedonaSqlParserwith the same delegate-first, Sedona-fallback pattern. Kept theparsePlanoverride as a defensive measure.Function registration smart guard
Spark 4.1 introduces built-in geospatial ST functions but these functions are NOT ready to be used. To ensure Sedona's implementations take precedence:
overrideIfExists = trueso they overwrite Spark's native ST functionsspark.sql.geospatial.enabled=falsein spark-4.1 tests to prevent Spark's native functions from shadowing Sedona'scommons-collections 3.x removal
Replaced
org.apache.commons.collections.list.UnmodifiableListusage withjava.util.Collections.unmodifiableList()inOsmRelation.javato avoid pulling in commons-collections 3.x (which has known CVEs and is not on Spark 4.1's classpath).OsmReaderTest self-join fix
Spark 4.1 rejects ambiguous self-joins. Fixed
OsmReaderTestby aliasing DataFrames before joining.PySpark 4.1 binary type fix
PySpark 4.1 returns
bytes(notbytearray) forBinaryTypecolumns. FixedST_AsBinary/ST_AsEWKBtests to accept both types viaisinstance(actual_result, (bytes, bytearray)).Python support
python_versionmarkers inpython/pyproject.tomlto handle PySpark 4.1 requiring Python 3.10+CI workflows
compileandunit-testjobs)python_version >= '3.10'Documentation
docs/setup/maven-coordinates.mdwith Spark 4.1 artifact coordinatesdocs/setup/platform.mdcompatibility table (Spark 4.1 requires Scala 2.13 and Python 3.10+)docs/community/publish.mdrelease checklistHow was this PR tested?
Key files changed