[GH-2609] Support Spark 4.1 by jiayuasu · Pull Request #2649 · apache/sedona

jiayuasu · 2026-02-12T19:40:26Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes Support Spark 4.1 #2609

What changes were proposed in this PR?

This PR adds support for Apache Spark 4.1 in Sedona.

Build scaffolding

Added sedona-spark-4.1 Maven profile in root pom.xml (Spark 4.1.1, Scala 2.13.17, Hadoop 3.4.1)
Added spark-4.1 module entry in spark/pom.xml (enable-all-submodules profile)
Added sedona-spark-4.1 profile in spark/common/pom.xml with spark-sql-api dependency
Created spark/spark-4.1/ module (copied from spark/spark-4.0/, updated artifactId)
Fixed Scala version mismatch: updated scala2.13 and sedona-spark-4.0 profiles to use Scala 2.13.17

Spark 4.1 API compatibility fixes

ParquetColumnVector.java: Changed setAllNull() to reflection-based markAllNull() that works with both setAllNull (Spark < 4.1) and setMissing (Spark 4.1+)
Functions.scala: Added explicit org.locationtech.jts.geom.Geometry import to resolve ambiguity with new org.apache.spark.sql.functions.Geometry in Spark 4.1
SedonaArrowEvalPythonExec.scala (spark-4.1 only): Added sessionUUID parameter required by Spark 4.1 ArrowEvalPythonExec

SPARK-52671 UDT workaround

Spark 4.1 changed RowEncoder.encoderForDataType to call udt.getClass directly instead of looking up via UDTRegistration. For Scala case object UDTs, getClass returns the module class (e.g., GeometryUDT$) which has a private constructor, causing ScalaReflectionException.

Fix: Added apply() factory methods to all three UDT case objects (GeometryUDT, GeographyUDT, RasterUDT) and toString overrides to return the simple class name. Replaced bare singleton references with UDT() calls across source files so that schema construction uses proper class instances.

Parser compatibility (Spark 4.1)

Spark 4.1's SparkSqlParser introduced an override for parsePlanWithParameters that calls parseInternal directly, bypassing parsePlan. Since SedonaSqlParser only overrode parsePlan, Sedona's SQL parsing interception was never invoked.

Fix: Added parsePlanWithParameters override in SedonaSqlParser with the same delegate-first, Sedona-fallback pattern. Kept the parsePlan override as a defensive measure.

Function registration smart guard

Spark 4.1 introduces built-in geospatial ST functions but these functions are NOT ready to be used. To ensure Sedona's implementations take precedence:

Register Sedona functions with overrideIfExists = true so they overwrite Spark's native ST functions
Skip re-registration of already-registered Sedona functions to avoid unnecessary overhead
Set spark.sql.geospatial.enabled=false in spark-4.1 tests to prevent Spark's native functions from shadowing Sedona's

commons-collections 3.x removal

Replaced org.apache.commons.collections.list.UnmodifiableList usage with java.util.Collections.unmodifiableList() in OsmRelation.java to avoid pulling in commons-collections 3.x (which has known CVEs and is not on Spark 4.1's classpath).

OsmReaderTest self-join fix

Spark 4.1 rejects ambiguous self-joins. Fixed OsmReaderTest by aliasing DataFrames before joining.

PySpark 4.1 binary type fix

PySpark 4.1 returns bytes (not bytearray) for BinaryType columns. Fixed ST_AsBinary/ST_AsEWKB tests to accept both types via isinstance(actual_result, (bytes, bytearray)).

Python support

Split pyspark dependency with python_version markers in python/pyproject.toml to handle PySpark 4.1 requiring Python 3.10+

CI workflows

java.yml: Added Spark 4.1.1 + Scala 2.13.17 + JDK 17 matrix entries (both compile and unit-test jobs)
python.yml: Added Spark 4.1.1 matrix entry with python_version >= '3.10'
example.yml: Added Spark 4.1 matrix entry

Documentation

Updated docs/setup/maven-coordinates.md with Spark 4.1 artifact coordinates
Updated docs/setup/platform.md compatibility table (Spark 4.1 requires Scala 2.13 and Python 3.10+)
Updated docs/community/publish.md release checklist

How was this PR tested?

All Spark/Scala/JDK combinations compile successfully:
- Spark 3.4 + Scala 2.12 + JDK 11
- Spark 3.4 + Scala 2.13 + JDK 11
- Spark 3.5 + Scala 2.12 + JDK 11
- Spark 4.0 + Scala 2.13 + JDK 17
- Spark 4.1 + Scala 2.13 + JDK 17
Spark 4.1 unit tests pass (SQLSyntaxTestScala, OsmReaderTest, etc.)
PySpark 4.1 tests pass (ST_AsBinary, ST_AsEWKB binary type handling)

Key files changed

Area	Files
Build	pom.xml, spark/pom.xml, spark/common/pom.xml, spark/spark-4.1/pom.xml
Spark 4.1 module	spark/spark-4.1/src/ (all files)
UDT workaround	GeometryUDT.scala, GeographyUDT.scala, RasterUDT.scala, plus ~50 files with schema changes
Parser fix	SedonaSqlParser.scala (spark-4.1)
Function registration	SedonaRegistrator.scala (spark-4.1)
API compat	ParquetColumnVector.java, Functions.scala
commons-collections	OsmRelation.java
Python	python/pyproject.toml, python/tests/sql/test_dataframe_api.py
CI	.github/workflows/java.yml, python.yml, example.yml
Docs	docs/setup/maven-coordinates.md, platform.md, docs/community/publish.md

- Add sedona-spark-4.1 Maven profile (Spark 4.1.0, Scala 2.13.17, Hadoop 3.4.1) - Create spark/spark-4.1 module based on spark-4.0 - Fix Geometry import ambiguity (Spark 4.1 adds o.a.s.sql.types.Geometry) - Fix WritableColumnVector.setAllNull() removal (replaced by setMissing() in 4.1) - Add sessionUUID parameter to ArrowPythonWithNamedArgumentRunner (new in 4.1) - Update docs (maven-coordinates, platform, publish) - Update CI workflows (java, example, python, docker-build)

…patibility table, refine CI matrices

Spark 4.1's RowEncoder calls udt.getClass directly, which returns the Scala module class (e.g. GeometryUDT$) with a private constructor for case objects, causing EXPRESSION_DECODING_FAILED errors. Fix: Add apply() method to GeometryUDT, GeographyUDT, and RasterUDT case objects that return new class instances, and use UDT() instead of the bare singleton throughout schema construction code. This ensures getClass returns the public class with an accessible constructor. Also: - Revert docker-build.yml (no Spark 4.1 in Docker builds) - Bump pyspark upper bound from <4.1.0 to <4.2.0 - Bump Spark 4.1.0 to 4.1.1 in CI and POM - Fix Scala 2.13.12 vs 2.13.17 mismatch in scala2.13 profile

…failure on Python <3.10

…disable geospatial in tests

…ion, overwrite Spark native

… on Spark 4.1

Spark 4.1 no longer provides commons-collections 3.x transitively. Replace FilterIterator with Java 8 stream filtering in DuplicatesFilter, and IteratorUtils.toList with StreamSupport in the test.

In Spark 4.1, SparkSqlParser introduced an override for parsePlanWithParameters that bypasses parsePlan entirely. SedonaSqlParser only overrode parsePlan, so its SQL parsing interception was never invoked on Spark 4.1. Fix by also overriding parsePlanWithParameters in SedonaSqlParser to use the same delegate-first, Sedona-fallback pattern. Also disable spark.sql.geospatial.enabled in tests to prevent Spark 4.1+ native geospatial functions from shadowing Sedona's ST functions.

PySpark 4.1 returns BinaryType columns as bytes instead of bytearray. Update the isinstance check to handle both types so ST_AsBinary and ST_AsEWKB test results are properly hex-encoded before comparison.

Copilot

Pull request overview

This pull request adds comprehensive support for Apache Spark 4.1 in Sedona, building upon the existing Spark 3.4, 3.5, and 4.0 support.

Changes:

Added Spark 4.1.1 build profiles and module structure (Maven POM files, CI workflows)
Implemented API compatibility fixes for Spark 4.1 changes (ParquetColumnVector reflection, SedonaSqlParser, SedonaArrowEvalPythonExec)
Applied SPARK-52671 UDT workaround by adding apply() factory methods to GeometryUDT, GeographyUDT, and RasterUDT case objects and replacing bare UDT references with UDT() calls throughout the codebase

Reviewed changes

Copilot reviewed 100 out of 100 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pom.xml	Added sedona-spark-4.1 Maven profile with Spark 4.1.1, Scala 2.13.17, Hadoop 3.4.1; updated Scala version to 2.13.17 for scala2.13 and sedona-spark-4.0 profiles
spark/pom.xml	Added spark-4.1 module entry in enable-all-submodules profile
spark/common/pom.xml	Added sedona-spark-4.1 profile with spark-sql-api dependency
spark/spark-4.1/pom.xml	New module POM with dependencies matching spark-4.0 structure
spark/spark-4.1/src/**	Complete Spark 4.1 module with main and test source files adapted from spark-4.0
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/*.scala	Added apply() factory methods to GeometryUDT, GeographyUDT, RasterUDT case objects
spark/*/src//.scala	Replaced bare UDT singleton references with UDT() calls across all Spark modules (3.4, 3.5, 4.0, 4.1, common)
spark/common/src/main/java/.../ParquetColumnVector.java	Changed setAllNull() to reflection-based markAllNull() supporting both Spark ≤4.0 and ≥4.1
spark/common/src/main/scala/.../Functions.scala	Added explicit Geometry import to resolve Spark 4.1 ambiguity
spark/spark-4.1/src/main/scala/.../SedonaArrowEvalPythonExec.scala	Added sessionUUID parameter required by Spark 4.1 ArrowEvalPythonExec
spark/spark-4.1/src/main/scala/.../SedonaSqlParser.scala	Overrode parsePlanWithParameters() for Spark 4.1 parse flow changes
python/pyproject.toml	Bumped pyspark upper bound from <4.1.0 to <4.2.0 (conditional on Python ≥3.10)
.github/workflows/*.yml	Added Spark 4.1.1 CI matrix entries; updated fail-fast to false
docs/setup/*.md	Updated documentation with Spark 4.1 compatibility tables and Maven coordinates
docs/community/publish.md	Updated release checklist for Spark 4.1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-14T06:37:04Z

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala

 import org.apache.spark.sql.types._
 import org.locationtech.jts.algorithm.MinimumBoundingCircle
 import org.locationtech.jts.geom._
+import org.locationtech.jts.geom.Geometry


The explicit import of org.locationtech.jts.geom.Geometry is added on line 35 but there's already a wildcard import of org.locationtech.jts.geom._ on line 34. The explicit import is redundant since the wildcard import already includes Geometry. While this doesn't cause compilation errors, it's unnecessary code duplication that could be simplified by removing the explicit import on line 35.

Suggested change

import org.locationtech.jts.geom.Geometry

github-actions bot added docs sedona-spark github_actions Pull requests that update GitHub Actions code root labels Feb 12, 2026

jiayuasu added 3 commits February 13, 2026 00:27

Remove Spark 4.1 from example CI

2a23a99

Fix docs and CI: remove Scala 2.12 tabs for Spark 4.1, fix Python com…

3c621e5

…patibility table, refine CI matrices

jiayuasu force-pushed the support-spark-4.1 branch from 6b14539 to 3c621e5 Compare February 13, 2026 08:28

github-actions bot added the sedona-python label Feb 13, 2026

jiayuasu added 9 commits February 13, 2026 09:08

fix: split pyspark dep with python_version markers to avoid resolver …

07ee40d

…failure on Python <3.10

fix: add toString override to UDT classes and disable fail-fast in CI

542d447

fix: force-overwrite Spark 4.1 native ST functions with Sedona's and …

e3eef2d

…disable geospatial in tests

fix: smart guard for function registration - skip Sedona re-registrat…

78682c6

…ion, overwrite Spark native

fix: use python_version >= 3.10 for all Spark 4.x in Python CI

660154a

fix: alias DataFrames in OsmReaderTest to resolve ambiguous self-join…

7d65d1b

… on Spark 4.1

fix: replace commons-collections 3.x with Java stdlib

391ad36

Spark 4.1 no longer provides commons-collections 3.x transitively. Replace FilterIterator with Java 8 stream filtering in DuplicatesFilter, and IteratorUtils.toList with StreamSupport in the test.

fix: handle bytes type for BinaryType columns in PySpark 4.1 tests

b4c3c17

PySpark 4.1 returns BinaryType columns as bytes instead of bytearray. Update the isinstance check to handle both types so ST_AsBinary and ST_AsEWKB test results are properly hex-encoded before comparison.

jiayuasu force-pushed the support-spark-4.1 branch from a8833e7 to b4c3c17 Compare February 14, 2026 06:06

jiayuasu requested a review from Copilot February 14, 2026 06:34

Copilot started reviewing on behalf of jiayuasu February 14, 2026 06:35 View session

Copilot AI reviewed Feb 14, 2026

View reviewed changes

fix: re-enable fail-fast in CI workflows

f81d5fe

jiayuasu marked this pull request as ready for review February 14, 2026 09:07

jiayuasu requested a review from jbampton as a code owner February 14, 2026 09:07

jiayuasu merged commit d72bb63 into master Feb 14, 2026
42 checks passed

jiayuasu added this to the sedona-1.9.0 milestone Feb 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2609] Support Spark 4.1#2649

[GH-2609] Support Spark 4.1#2649
jiayuasu merged 14 commits intomasterfrom
support-spark-4.1

jiayuasu commented Feb 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

jiayuasu commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Build scaffolding

Spark 4.1 API compatibility fixes

SPARK-52671 UDT workaround

Parser compatibility (Spark 4.1)

Function registration smart guard

commons-collections 3.x removal

OsmReaderTest self-join fix

PySpark 4.1 binary type fix

Python support

CI workflows

Documentation

How was this PR tested?

Key files changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

jiayuasu commented Feb 12, 2026 •

edited

Loading