Conversation
…ce V2 reader Implement SupportsMetadataColumns on GeoPackageTable so that reading GeoPackage files into a DataFrame exposes the standard _metadata hidden struct containing file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time. Changes across all four Spark version modules (3.4, 3.5, 4.0, 4.1): - GeoPackageTable: mix in SupportsMetadataColumns, define the _metadata MetadataColumn with the standard six-field struct type - GeoPackageScanBuilder: override pruneColumns() to capture the pruned metadata schema requested by Spark column pruning optimizer - GeoPackageScan: accept metadataSchema, override readSchema() to append metadata fields, pass schema to partition reader factory - GeoPackagePartitionReaderFactory: construct metadata values from the PartitionedFile, wrap the base reader in PartitionReaderWithMetadata that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d thalues against the filesystem, filtering, and projection.
There was a problem hiding this comment.
Pull request overview
This PR adds support for the _metadata hidden column to the GeoPackage DataSource V2 reader, enabling users to query file-level metadata (file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time) on GeoPackage DataFrames. The implementation follows Spark's standard pattern for metadata columns and is consistently applied across all four Spark version modules (3.4, 3.5, 4.0, 4.1).
Changes:
- Implemented
SupportsMetadataColumnsinterface inGeoPackageTableto expose the_metadatahidden column - Added metadata schema pruning logic in
GeoPackageScanBuilderto capture requested metadata fields - Modified
GeoPackageScanandGeoPackagePartitionReaderFactoryto propagate and populate metadata values - Added comprehensive test coverage to verify schema, hidden column semantics, value correctness, filtering, and projection
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala | Implements SupportsMetadataColumns and defines _metadata column structure with 6 standard fields |
| spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageScanBuilder.scala | Overrides pruneColumns to capture pruned metadata schema from Spark's optimizer |
| spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageScan.scala | Accepts metadataSchema parameter and includes it in readSchema output |
| spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackagePartitionReaderFactory.scala | Constructs metadata values from PartitionedFile and wraps base reader with PartitionReaderWithMetadata |
| spark/spark-4.1/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala | Adds 8 test cases covering schema validation, hidden column semantics, value correctness, filtering, and projection |
| spark/spark-4.0/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala | Identical implementation for Spark 4.0 |
| spark/spark-4.0/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala | Identical test coverage for Spark 4.0 |
| spark/spark-3.5/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala | Identical implementation for Spark 3.5 |
| spark/spark-3.5/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala | Identical test coverage for Spark 3.5 |
| spark/spark-3.4/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala | Identical implementation for Spark 3.4 |
| spark/spark-3.4/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala | Identical test coverage for Spark 3.4 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs} | ||
| import org.apache.spark.sql.{DataFrame, SparkSession} | ||
| import org.apache.spark.sql.{DataFrame, Row, SparkSession} |
There was a problem hiding this comment.
The Row import is unused and should be removed. The tests do not create or use Row objects directly.
| import org.apache.spark.sql.{DataFrame, Row, SparkSession} | |
| import org.apache.spark.sql.{DataFrame, SparkSession} |
|
|
||
| import io.minio.{MakeBucketArgs, MinioClient} | ||
| import org.apache.spark.sql.DataFrame | ||
| import org.apache.spark.sql.{DataFrame, Row} |
There was a problem hiding this comment.
The Row import is unused and should be removed. The tests do not create or use Row objects directly.
| import org.apache.spark.sql.{DataFrame, Row} | |
| import org.apache.spark.sql.DataFrame |
|
|
||
| import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs} | ||
| import org.apache.spark.sql.{DataFrame, SparkSession} | ||
| import org.apache.spark.sql.{DataFrame, Row, SparkSession} |
There was a problem hiding this comment.
The Row import is unused and should be removed. The tests do not create or use Row objects directly.
| import org.apache.spark.sql.{DataFrame, Row, SparkSession} | |
| import org.apache.spark.sql.{DataFrame, SparkSession} |
|
|
||
| import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs} | ||
| import org.apache.spark.sql.{DataFrame, SparkSession} | ||
| import org.apache.spark.sql.{DataFrame, Row, SparkSession} |
There was a problem hiding this comment.
The Row import is unused and should be removed. The tests do not create or use Row objects directly.
| import org.apache.spark.sql.{DataFrame, Row, SparkSession} | |
| import org.apache.spark.sql.{DataFrame, SparkSession} |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes GeoPackage reader does not support _metadata hidden column #2651What changes were proposed in this PR?
When reading GeoPackage files via the DataSource V2 API, the standard
_metadatahidden column (containingfile_path,file_name,file_size,file_block_start,file_block_length,file_modification_time) was missing from the DataFrame. This is becauseGeoPackageTabledid not implement Spark'sSupportsMetadataColumnsinterface.This PR implements
_metadatasupport across all four Spark version modules (3.4, 3.5, 4.0, 4.1) by modifying four source files per module:SupportsMetadataColumnsand defines the_metadataMetadataColumnwith the standard six-field struct type.pruneColumns()to capture the pruned metadata schema requested by Spark's column pruning optimizer.metadataSchemaparameter, overridesreadSchema()to append metadata fields to the output schema, and passes the schema to the partition reader factory.PartitionedFile, and wraps the base reader in aPartitionReaderWithMetadatathat joins data rows with metadata usingJoinedRow+GenerateUnsafeProjection. Correctly handles Spark's struct pruning by building only the requested sub-fields.After this change, users can query
_metadataon GeoPackage DataFrames just like Parquet/ORC/CSV:How was this patch tested?
8 new test cases added to
GeoPackageReaderTest(per Spark version module) covering:_metadatastruct contains all 6 expected fields with correct types_metadatadoes not appear inselect(*)but can be explicitly selectedfile_path,file_name,file_size,file_block_start,file_block_length, andfile_modification_timeare verified against actual filesystem values usingjava.io.FileAPIs_metadatafields can be used inWHEREclauses_metadatafields can be selected alongside data columnsAll tests pass on all four Spark versions:
Did this PR include necessary documentation updates?
_metadatacolumn is a standard Spark hidden column that is automatically available to users — no Sedona-specific API changes are introduced.