Skip to content

[GH-2609] Support Spark 4.1#2649

Merged
jiayuasu merged 14 commits intomasterfrom
support-spark-4.1
Feb 14, 2026
Merged

[GH-2609] Support Spark 4.1#2649
jiayuasu merged 14 commits intomasterfrom
support-spark-4.1

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Feb 12, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

This PR adds support for Apache Spark 4.1 in Sedona.

Build scaffolding

  • Added sedona-spark-4.1 Maven profile in root pom.xml (Spark 4.1.1, Scala 2.13.17, Hadoop 3.4.1)
  • Added spark-4.1 module entry in spark/pom.xml (enable-all-submodules profile)
  • Added sedona-spark-4.1 profile in spark/common/pom.xml with spark-sql-api dependency
  • Created spark/spark-4.1/ module (copied from spark/spark-4.0/, updated artifactId)
  • Fixed Scala version mismatch: updated scala2.13 and sedona-spark-4.0 profiles to use Scala 2.13.17

Spark 4.1 API compatibility fixes

  • ParquetColumnVector.java: Changed setAllNull() to reflection-based markAllNull() that works with both setAllNull (Spark < 4.1) and setMissing (Spark 4.1+)
  • Functions.scala: Added explicit org.locationtech.jts.geom.Geometry import to resolve ambiguity with new org.apache.spark.sql.functions.Geometry in Spark 4.1
  • SedonaArrowEvalPythonExec.scala (spark-4.1 only): Added sessionUUID parameter required by Spark 4.1 ArrowEvalPythonExec

SPARK-52671 UDT workaround

Spark 4.1 changed RowEncoder.encoderForDataType to call udt.getClass directly instead of looking up via UDTRegistration. For Scala case object UDTs, getClass returns the module class (e.g., GeometryUDT$) which has a private constructor, causing ScalaReflectionException.

Fix: Added apply() factory methods to all three UDT case objects (GeometryUDT, GeographyUDT, RasterUDT) and toString overrides to return the simple class name. Replaced bare singleton references with UDT() calls across source files so that schema construction uses proper class instances.

Parser compatibility (Spark 4.1)

Spark 4.1's SparkSqlParser introduced an override for parsePlanWithParameters that calls parseInternal directly, bypassing parsePlan. Since SedonaSqlParser only overrode parsePlan, Sedona's SQL parsing interception was never invoked.

Fix: Added parsePlanWithParameters override in SedonaSqlParser with the same delegate-first, Sedona-fallback pattern. Kept the parsePlan override as a defensive measure.

Function registration smart guard

Spark 4.1 introduces built-in geospatial ST functions but these functions are NOT ready to be used. To ensure Sedona's implementations take precedence:

  • Register Sedona functions with overrideIfExists = true so they overwrite Spark's native ST functions
  • Skip re-registration of already-registered Sedona functions to avoid unnecessary overhead
  • Set spark.sql.geospatial.enabled=false in spark-4.1 tests to prevent Spark's native functions from shadowing Sedona's

commons-collections 3.x removal

Replaced org.apache.commons.collections.list.UnmodifiableList usage with java.util.Collections.unmodifiableList() in OsmRelation.java to avoid pulling in commons-collections 3.x (which has known CVEs and is not on Spark 4.1's classpath).

OsmReaderTest self-join fix

Spark 4.1 rejects ambiguous self-joins. Fixed OsmReaderTest by aliasing DataFrames before joining.

PySpark 4.1 binary type fix

PySpark 4.1 returns bytes (not bytearray) for BinaryType columns. Fixed ST_AsBinary/ST_AsEWKB tests to accept both types via isinstance(actual_result, (bytes, bytearray)).

Python support

  • Split pyspark dependency with python_version markers in python/pyproject.toml to handle PySpark 4.1 requiring Python 3.10+

CI workflows

  • java.yml: Added Spark 4.1.1 + Scala 2.13.17 + JDK 17 matrix entries (both compile and unit-test jobs)
  • python.yml: Added Spark 4.1.1 matrix entry with python_version >= '3.10'
  • example.yml: Added Spark 4.1 matrix entry

Documentation

  • Updated docs/setup/maven-coordinates.md with Spark 4.1 artifact coordinates
  • Updated docs/setup/platform.md compatibility table (Spark 4.1 requires Scala 2.13 and Python 3.10+)
  • Updated docs/community/publish.md release checklist

How was this PR tested?

  • All Spark/Scala/JDK combinations compile successfully:
    • Spark 3.4 + Scala 2.12 + JDK 11
    • Spark 3.4 + Scala 2.13 + JDK 11
    • Spark 3.5 + Scala 2.12 + JDK 11
    • Spark 4.0 + Scala 2.13 + JDK 17
    • Spark 4.1 + Scala 2.13 + JDK 17
  • Spark 4.1 unit tests pass (SQLSyntaxTestScala, OsmReaderTest, etc.)
  • PySpark 4.1 tests pass (ST_AsBinary, ST_AsEWKB binary type handling)

Key files changed

Area Files
Build pom.xml, spark/pom.xml, spark/common/pom.xml, spark/spark-4.1/pom.xml
Spark 4.1 module spark/spark-4.1/src/ (all files)
UDT workaround GeometryUDT.scala, GeographyUDT.scala, RasterUDT.scala, plus ~50 files with schema changes
Parser fix SedonaSqlParser.scala (spark-4.1)
Function registration SedonaRegistrator.scala (spark-4.1)
API compat ParquetColumnVector.java, Functions.scala
commons-collections OsmRelation.java
Python python/pyproject.toml, python/tests/sql/test_dataframe_api.py
CI .github/workflows/java.yml, python.yml, example.yml
Docs docs/setup/maven-coordinates.md, platform.md, docs/community/publish.md

@github-actions github-actions bot added docs sedona-spark github_actions Pull requests that update GitHub Actions code root labels Feb 12, 2026
- Add sedona-spark-4.1 Maven profile (Spark 4.1.0, Scala 2.13.17, Hadoop 3.4.1)
- Create spark/spark-4.1 module based on spark-4.0
- Fix Geometry import ambiguity (Spark 4.1 adds o.a.s.sql.types.Geometry)
- Fix WritableColumnVector.setAllNull() removal (replaced by setMissing() in 4.1)
- Add sessionUUID parameter to ArrowPythonWithNamedArgumentRunner (new in 4.1)
- Update docs (maven-coordinates, platform, publish)
- Update CI workflows (java, example, python, docker-build)
Spark 4.1's RowEncoder calls udt.getClass directly, which returns the
Scala module class (e.g. GeometryUDT$) with a private constructor for
case objects, causing EXPRESSION_DECODING_FAILED errors.

Fix: Add apply() method to GeometryUDT, GeographyUDT, and RasterUDT
case objects that return new class instances, and use UDT() instead of
the bare singleton throughout schema construction code. This ensures
getClass returns the public class with an accessible constructor.

Also:
- Revert docker-build.yml (no Spark 4.1 in Docker builds)
- Bump pyspark upper bound from <4.1.0 to <4.2.0
- Bump Spark 4.1.0 to 4.1.1 in CI and POM
- Fix Scala 2.13.12 vs 2.13.17 mismatch in scala2.13 profile
Spark 4.1 no longer provides commons-collections 3.x transitively.
Replace FilterIterator with Java 8 stream filtering in DuplicatesFilter,
and IteratorUtils.toList with StreamSupport in the test.
In Spark 4.1, SparkSqlParser introduced an override for
parsePlanWithParameters that bypasses parsePlan entirely.
SedonaSqlParser only overrode parsePlan, so its SQL parsing
interception was never invoked on Spark 4.1.

Fix by also overriding parsePlanWithParameters in SedonaSqlParser
to use the same delegate-first, Sedona-fallback pattern.

Also disable spark.sql.geospatial.enabled in tests to prevent
Spark 4.1+ native geospatial functions from shadowing Sedona's
ST functions.
PySpark 4.1 returns BinaryType columns as bytes instead of bytearray.
Update the isinstance check to handle both types so ST_AsBinary and
ST_AsEWKB test results are properly hex-encoded before comparison.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive support for Apache Spark 4.1 in Sedona, building upon the existing Spark 3.4, 3.5, and 4.0 support.

Changes:

  • Added Spark 4.1.1 build profiles and module structure (Maven POM files, CI workflows)
  • Implemented API compatibility fixes for Spark 4.1 changes (ParquetColumnVector reflection, SedonaSqlParser, SedonaArrowEvalPythonExec)
  • Applied SPARK-52671 UDT workaround by adding apply() factory methods to GeometryUDT, GeographyUDT, and RasterUDT case objects and replacing bare UDT references with UDT() calls throughout the codebase

Reviewed changes

Copilot reviewed 100 out of 100 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pom.xml Added sedona-spark-4.1 Maven profile with Spark 4.1.1, Scala 2.13.17, Hadoop 3.4.1; updated Scala version to 2.13.17 for scala2.13 and sedona-spark-4.0 profiles
spark/pom.xml Added spark-4.1 module entry in enable-all-submodules profile
spark/common/pom.xml Added sedona-spark-4.1 profile with spark-sql-api dependency
spark/spark-4.1/pom.xml New module POM with dependencies matching spark-4.0 structure
spark/spark-4.1/src/** Complete Spark 4.1 module with main and test source files adapted from spark-4.0
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/*.scala Added apply() factory methods to GeometryUDT, GeographyUDT, RasterUDT case objects
spark/*/src//.scala Replaced bare UDT singleton references with UDT() calls across all Spark modules (3.4, 3.5, 4.0, 4.1, common)
spark/common/src/main/java/.../ParquetColumnVector.java Changed setAllNull() to reflection-based markAllNull() supporting both Spark ≤4.0 and ≥4.1
spark/common/src/main/scala/.../Functions.scala Added explicit Geometry import to resolve Spark 4.1 ambiguity
spark/spark-4.1/src/main/scala/.../SedonaArrowEvalPythonExec.scala Added sessionUUID parameter required by Spark 4.1 ArrowEvalPythonExec
spark/spark-4.1/src/main/scala/.../SedonaSqlParser.scala Overrode parsePlanWithParameters() for Spark 4.1 parse flow changes
python/pyproject.toml Bumped pyspark upper bound from <4.1.0 to <4.2.0 (conditional on Python ≥3.10)
.github/workflows/*.yml Added Spark 4.1.1 CI matrix entries; updated fail-fast to false
docs/setup/*.md Updated documentation with Spark 4.1 compatibility tables and Maven coordinates
docs/community/publish.md Updated release checklist for Spark 4.1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import org.apache.spark.sql.types._
import org.locationtech.jts.algorithm.MinimumBoundingCircle
import org.locationtech.jts.geom._
import org.locationtech.jts.geom.Geometry
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explicit import of org.locationtech.jts.geom.Geometry is added on line 35 but there's already a wildcard import of org.locationtech.jts.geom._ on line 34. The explicit import is redundant since the wildcard import already includes Geometry. While this doesn't cause compilation errors, it's unnecessary code duplication that could be simplified by removing the explicit import on line 35.

Suggested change
import org.locationtech.jts.geom.Geometry

Copilot uses AI. Check for mistakes.
@jiayuasu jiayuasu marked this pull request as ready for review February 14, 2026 09:07
@jiayuasu jiayuasu requested a review from jbampton as a code owner February 14, 2026 09:07
@jiayuasu jiayuasu merged commit d72bb63 into master Feb 14, 2026
42 checks passed
@jiayuasu jiayuasu added this to the sedona-1.9.0 milestone Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs github_actions Pull requests that update GitHub Actions code root sedona-python sedona-spark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Spark 4.1

1 participant

Comments