Skip to content

refactor(rust/sedona-spatial-join): Allow SpatialIndexBuilder and EvaluatedGeometryArray to be configurable#737

Open
paleolimbot wants to merge 15 commits intoapache:mainfrom
paleolimbot:configurable-index-builder
Open

refactor(rust/sedona-spatial-join): Allow SpatialIndexBuilder and EvaluatedGeometryArray to be configurable#737
paleolimbot wants to merge 15 commits intoapache:mainfrom
paleolimbot:configurable-index-builder

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Mar 26, 2026

This PR is primarily in preparation for a Geography join, although it also extracts the configurable pieces required for a GPU join as well. In combination with #735 this (hopefully with tweaks) allows both the GPU join and the Geography join to be implemented in separate crates to minimize the disruption to sedona-spatial-join and unify the code pathways between all three of the default join, the GPU join, and the Geography join (and should in theory enable all of them to work with the partitioning approach for larger than memory input if I'm reading the code correctly).

The key piece that the Geography join requires is the ability to override the Cartesian bounding performed in EvaluatedGeometryArray::try_new(). Other than that I think we could use the default index at least to get started. The GPU join definitely needs to override the SpatialIndex and SpatialIndexBuilder so I also included this change. There are probably some details I'm missing for both but hopefully this takes care of most of the disruption.

I did this by introducing a trait SpatialJoinProvider and providing a default implementation such that all of the current spatial join pathways run through it (I'll highlight this inline shortly). The part that could especially use review is the part where I add the Rects to the spilled batches. This was primarily to avoid having pass the SpatialJoinProvider through the machinery that reads the spilled batches; however, it also avoids unnecessarily recomputing bounds (which is probably great for large polygons but maybe not so great for points).

@github-actions github-actions bot requested a review from prantogg March 26, 2026 16:58
@paleolimbot paleolimbot changed the title refactor(rust/sedona-spatial-join): Configurable spatial join provider refactor(rust/sedona-spatial-join): Allow SpatialIndexBuilder and EvaluatedGeometryArray to be configurable Mar 26, 2026
Comment on lines +34 to +61
/// Provider for join internals
///
/// This trait provides an extension point for overriding the evaluation
/// details of a spatial join. In particular it allows plugging in a custom
/// index for accellerated joins on specific hardware (e.g., GPU) and a custom
/// bounder for specific data types (e.g., geography).
pub(crate) trait SpatialJoinProvider: std::fmt::Debug + Send + Sync {
/// Create a new [SpatialIndexBuilder]
fn try_new_spatial_index_builder(
&self,
schema: SchemaRef,
spatial_predicate: SpatialPredicate,
options: SpatialJoinOptions,
join_type: JoinType,
probe_threads_count: usize,
metrics: SpatialJoinBuildMetrics,
) -> Result<Box<dyn SpatialIndexBuilder>>;

/// Create a new [EvaluatedGeometryArray]
///
/// Use [EvaluatedGeometryArray::try_new_with_rects] for custom computation of
/// bounding rectangles for specific items.
fn try_new_evaluated_array(
&self,
geometry_array: ArrayRef,
sedona_type: &SedonaType,
) -> Result<EvaluatedGeometryArray>;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the extension point that is the primary motivation for this PR. Implementing it in this way allows all of the existing join pathways to run through this trait to reduce the test responsibility for custom joins.

}
let batch = interleave_record_batch(record_batches, indices)?;
let geom_array = interleave_geometry_array(geom_arrays, indices)?;
let geom_array = EvaluatedGeometryArray::interleave(geom_arrays, indices)?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to avoid having to pipe through the SpatialJoinEvaluator here; however, I think it is also positive (the rectangles from the input EvaluatedGeometryArray are used instead of being recomputed).

options: &SpatialJoinOptions,
) -> usize;

fn operand_evaluator(&self) -> Arc<dyn OperandEvaluator>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed this because the OperandEvaluator is one of the places where EvaluatedGeometryArrays are created and this eliminated one more place where I had to pipe through the SpatialJoinProvider

Comment on lines +76 to +81
let rect_field = Field::new(
"rect",
DataType::new_fixed_size_list(DataType::Float32, 4, true),
true,
);
let spill_schema = Schema::new(vec![data_struct_field, geom_field, dist_field, rect_field]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spilling the rectangles here was primarily to avoid having to pipe the SpatialIndexProvider through the machinery that reads the batches, which is a verbose change. This is potentially faster because it avoid recomputing bounds and potentially slower because it increases the size of the spilled batch.

Comment on lines -652 to -658
fn interleave_geometry_array(
geom_arrays: &[&EvaluatedGeometryArray],
indices: &[(usize, usize)],
) -> Result<EvaluatedGeometryArray> {
if geom_arrays.is_empty() {
return sedona_internal_err!("interleave_geometry_array requires at least one batch");
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just moved to EvaluatedGeometryArray (with its tests)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors sedona-spatial-join to introduce a SpatialJoinProvider extension point intended to make spatial join internals (spatial index building and geometry bound computation) configurable for future Geography and GPU join implementations, while keeping default behavior intact.

Changes:

  • Introduces SpatialJoinProvider (+ default implementation) and threads it through probe evaluation and index-building plumbing.
  • Refactors EvaluatedGeometryArray to support construction with precomputed rects and adds an interleave helper.
  • Persists precomputed rects in spilled batches to avoid recomputation when reading spilled data.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
rust/sedona-spatial-join/src/stream.rs Passes a join provider into operand evaluator creation for probe-side evaluation.
rust/sedona-spatial-join/src/prepare.rs Wires a provider into PartitionedIndexProvider construction (currently defaulted).
rust/sedona-spatial-join/src/partitioning/stream_repartitioner.rs Switches evaluated-geometry interleave logic to EvaluatedGeometryArray::interleave.
rust/sedona-spatial-join/src/operand_evaluator.rs Adds provider-aware geometry evaluation, try_new_with_rects, and interleave helpers (incl. distance interleave).
rust/sedona-spatial-join/src/lib.rs Adds the new join_provider module.
rust/sedona-spatial-join/src/join_provider.rs Defines SpatialJoinProvider and DefaultSpatialJoinProvider.
rust/sedona-spatial-join/src/index/spatial_index_builder.rs Extends builder API with operand_evaluator() and changes finish signature.
rust/sedona-spatial-join/src/index/partitioned_index_provider.rs Uses SpatialJoinProvider to instantiate spatial index builders.
rust/sedona-spatial-join/src/index/default_spatial_index_builder.rs Implements updated builder interface and creates operand evaluator via provider path.
rust/sedona-spatial-join/src/index/default_spatial_index.rs Updates evaluator construction and adjusts tests for new builder mutability.
rust/sedona-spatial-join/src/index/build_side_collector.rs Uses builder-provided operand evaluator (instead of creating one directly).
rust/sedona-spatial-join/src/evaluated_batch/spill.rs Adds rect column to spill schema and reconstructs evaluated arrays from stored rects.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@paleolimbot paleolimbot marked this pull request as ready for review March 26, 2026 20:56
@paleolimbot
Copy link
Member Author

@pwrliang Do you have the bandwidth to review this? (and also let me know if this isn't going to work for the GPU join!?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants