refactor(rust/sedona-spatial-join): Allow SpatialIndexBuilder and EvaluatedGeometryArray to be configurable#737
Conversation
| /// Provider for join internals | ||
| /// | ||
| /// This trait provides an extension point for overriding the evaluation | ||
| /// details of a spatial join. In particular it allows plugging in a custom | ||
| /// index for accellerated joins on specific hardware (e.g., GPU) and a custom | ||
| /// bounder for specific data types (e.g., geography). | ||
| pub(crate) trait SpatialJoinProvider: std::fmt::Debug + Send + Sync { | ||
| /// Create a new [SpatialIndexBuilder] | ||
| fn try_new_spatial_index_builder( | ||
| &self, | ||
| schema: SchemaRef, | ||
| spatial_predicate: SpatialPredicate, | ||
| options: SpatialJoinOptions, | ||
| join_type: JoinType, | ||
| probe_threads_count: usize, | ||
| metrics: SpatialJoinBuildMetrics, | ||
| ) -> Result<Box<dyn SpatialIndexBuilder>>; | ||
|
|
||
| /// Create a new [EvaluatedGeometryArray] | ||
| /// | ||
| /// Use [EvaluatedGeometryArray::try_new_with_rects] for custom computation of | ||
| /// bounding rectangles for specific items. | ||
| fn try_new_evaluated_array( | ||
| &self, | ||
| geometry_array: ArrayRef, | ||
| sedona_type: &SedonaType, | ||
| ) -> Result<EvaluatedGeometryArray>; | ||
| } |
There was a problem hiding this comment.
This is the extension point that is the primary motivation for this PR. Implementing it in this way allows all of the existing join pathways to run through this trait to reduce the test responsibility for custom joins.
| } | ||
| let batch = interleave_record_batch(record_batches, indices)?; | ||
| let geom_array = interleave_geometry_array(geom_arrays, indices)?; | ||
| let geom_array = EvaluatedGeometryArray::interleave(geom_arrays, indices)?; |
There was a problem hiding this comment.
I did this to avoid having to pipe through the SpatialJoinEvaluator here; however, I think it is also positive (the rectangles from the input EvaluatedGeometryArray are used instead of being recomputed).
| options: &SpatialJoinOptions, | ||
| ) -> usize; | ||
|
|
||
| fn operand_evaluator(&self) -> Arc<dyn OperandEvaluator>; |
There was a problem hiding this comment.
I needed this because the OperandEvaluator is one of the places where EvaluatedGeometryArrays are created and this eliminated one more place where I had to pipe through the SpatialJoinProvider
| let rect_field = Field::new( | ||
| "rect", | ||
| DataType::new_fixed_size_list(DataType::Float32, 4, true), | ||
| true, | ||
| ); | ||
| let spill_schema = Schema::new(vec![data_struct_field, geom_field, dist_field, rect_field]); |
There was a problem hiding this comment.
Spilling the rectangles here was primarily to avoid having to pipe the SpatialIndexProvider through the machinery that reads the batches, which is a verbose change. This is potentially faster because it avoid recomputing bounds and potentially slower because it increases the size of the spilled batch.
| fn interleave_geometry_array( | ||
| geom_arrays: &[&EvaluatedGeometryArray], | ||
| indices: &[(usize, usize)], | ||
| ) -> Result<EvaluatedGeometryArray> { | ||
| if geom_arrays.is_empty() { | ||
| return sedona_internal_err!("interleave_geometry_array requires at least one batch"); | ||
| } |
There was a problem hiding this comment.
This was just moved to EvaluatedGeometryArray (with its tests)
There was a problem hiding this comment.
Pull request overview
Refactors sedona-spatial-join to introduce a SpatialJoinProvider extension point intended to make spatial join internals (spatial index building and geometry bound computation) configurable for future Geography and GPU join implementations, while keeping default behavior intact.
Changes:
- Introduces
SpatialJoinProvider(+ default implementation) and threads it through probe evaluation and index-building plumbing. - Refactors
EvaluatedGeometryArrayto support construction with precomputed rects and adds an interleave helper. - Persists precomputed rects in spilled batches to avoid recomputation when reading spilled data.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona-spatial-join/src/stream.rs | Passes a join provider into operand evaluator creation for probe-side evaluation. |
| rust/sedona-spatial-join/src/prepare.rs | Wires a provider into PartitionedIndexProvider construction (currently defaulted). |
| rust/sedona-spatial-join/src/partitioning/stream_repartitioner.rs | Switches evaluated-geometry interleave logic to EvaluatedGeometryArray::interleave. |
| rust/sedona-spatial-join/src/operand_evaluator.rs | Adds provider-aware geometry evaluation, try_new_with_rects, and interleave helpers (incl. distance interleave). |
| rust/sedona-spatial-join/src/lib.rs | Adds the new join_provider module. |
| rust/sedona-spatial-join/src/join_provider.rs | Defines SpatialJoinProvider and DefaultSpatialJoinProvider. |
| rust/sedona-spatial-join/src/index/spatial_index_builder.rs | Extends builder API with operand_evaluator() and changes finish signature. |
| rust/sedona-spatial-join/src/index/partitioned_index_provider.rs | Uses SpatialJoinProvider to instantiate spatial index builders. |
| rust/sedona-spatial-join/src/index/default_spatial_index_builder.rs | Implements updated builder interface and creates operand evaluator via provider path. |
| rust/sedona-spatial-join/src/index/default_spatial_index.rs | Updates evaluator construction and adjusts tests for new builder mutability. |
| rust/sedona-spatial-join/src/index/build_side_collector.rs | Uses builder-provided operand evaluator (instead of creating one directly). |
| rust/sedona-spatial-join/src/evaluated_batch/spill.rs | Adds rect column to spill schema and reconstructs evaluated arrays from stored rects. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rust/sedona-spatial-join/src/index/partitioned_index_provider.rs
Outdated
Show resolved
Hide resolved
|
@pwrliang Do you have the bandwidth to review this? (and also let me know if this isn't going to work for the GPU join!?) |
This PR is primarily in preparation for a Geography join, although it also extracts the configurable pieces required for a GPU join as well. In combination with #735 this (hopefully with tweaks) allows both the GPU join and the Geography join to be implemented in separate crates to minimize the disruption to sedona-spatial-join and unify the code pathways between all three of the default join, the GPU join, and the Geography join (and should in theory enable all of them to work with the partitioning approach for larger than memory input if I'm reading the code correctly).
The key piece that the Geography join requires is the ability to override the Cartesian bounding performed in
EvaluatedGeometryArray::try_new(). Other than that I think we could use the default index at least to get started. The GPU join definitely needs to override the SpatialIndex and SpatialIndexBuilder so I also included this change. There are probably some details I'm missing for both but hopefully this takes care of most of the disruption.I did this by introducing a
trait SpatialJoinProviderand providing a default implementation such that all of the current spatial join pathways run through it (I'll highlight this inline shortly). The part that could especially use review is the part where I add theRects to the spilled batches. This was primarily to avoid having pass theSpatialJoinProviderthrough the machinery that reads the spilled batches; however, it also avoids unnecessarily recomputing bounds (which is probably great for large polygons but maybe not so great for points).