[GH-2651] Add _metadata hidden column support for GeoPackage DataSource V2 reader by jiayuasu · Pull Request #2654 · apache/sedona

jiayuasu · 2026-02-15T07:24:29Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Developer Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes GeoPackage reader does not support _metadata hidden column #2651

What changes were proposed in this PR?

When reading GeoPackage files via the DataSource V2 API, the standard _metadata hidden column (containing file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time) was missing from the DataFrame. This is because GeoPackageTable did not implement Spark's SupportsMetadataColumns interface.

This PR implements _metadata support across all four Spark version modules (3.4, 3.5, 4.0, 4.1) by modifying four source files per module:

GeoPackageTable — Mixes in SupportsMetadataColumns and defines the _metadata MetadataColumn with the standard six-field struct type.
GeoPackageScanBuilder — Overrides pruneColumns() to capture the pruned metadata schema requested by Spark's column pruning optimizer.
GeoPackageScan — Accepts the metadataSchema parameter, overrides readSchema() to append metadata fields to the output schema, and passes the schema to the partition reader factory.
GeoPackagePartitionReaderFactory — Constructs metadata values (path, name, size, block offset/length, modification time) from the PartitionedFile, and wraps the base reader in a PartitionReaderWithMetadata that joins data rows with metadata using JoinedRow + GenerateUnsafeProjection. Correctly handles Spark's struct pruning by building only the requested sub-fields.

After this change, users can query _metadata on GeoPackage DataFrames just like Parquet/ORC/CSV:

val df = spark.read.format("geopackage").option("tableName", "my_table").load("/path/to/data.gpkg")
df.select("geometry", "_metadata.file_name", "_metadata.file_size").show()
df.filter($"_metadata.file_name" === "specific.gpkg").show()

How was this patch tested?

8 new test cases added to GeoPackageReaderTest (per Spark version module) covering:

Schema validation: _metadata struct contains all 6 expected fields with correct types
Hidden column semantics: _metadata does not appear in select(*) but can be explicitly selected
Value correctness: file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time are verified against actual filesystem values using java.io.File APIs
Filtering: _metadata fields can be used in WHERE clauses
Projection: _metadata fields can be selected alongside data columns

All tests pass on all four Spark versions:

spark-3.4 (Scala 2.12): 18 tests passed
spark-3.5 (Scala 2.12): 18 tests passed
spark-4.0 (Scala 2.13): 18 tests passed
spark-4.1 (Scala 2.13): 18 tests passed

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation. The _metadata column is a standard Spark hidden column that is automatically available to users — no Sedona-specific API changes are introduced.

…ce V2 reader Implement SupportsMetadataColumns on GeoPackageTable so that reading GeoPackage files into a DataFrame exposes the standard _metadata hidden struct containing file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time. Changes across all four Spark version modules (3.4, 3.5, 4.0, 4.1): - GeoPackageTable: mix in SupportsMetadataColumns, define the _metadata MetadataColumn with the standard six-field struct type - GeoPackageScanBuilder: override pruneColumns() to capture the pruned metadata schema requested by Spark column pruning optimizer - GeoPackageScan: accept metadataSchema, override readSchema() to append metadata fields, pass schema to partition reader factory - GeoPackagePartitionReaderFactory: construct metadata values from the PartitionedFile, wrap the base reader in PartitionReaderWithMetadata that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d that joins d thalues against the filesystem, filtering, and projection.

Copilot

Pull request overview

This PR adds support for the _metadata hidden column to the GeoPackage DataSource V2 reader, enabling users to query file-level metadata (file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time) on GeoPackage DataFrames. The implementation follows Spark's standard pattern for metadata columns and is consistently applied across all four Spark version modules (3.4, 3.5, 4.0, 4.1).

Changes:

Implemented SupportsMetadataColumns interface in GeoPackageTable to expose the _metadata hidden column
Added metadata schema pruning logic in GeoPackageScanBuilder to capture requested metadata fields
Modified GeoPackageScan and GeoPackagePartitionReaderFactory to propagate and populate metadata values
Added comprehensive test coverage to verify schema, hidden column semantics, value correctness, filtering, and projection

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala	Implements SupportsMetadataColumns and defines _metadata column structure with 6 standard fields
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageScanBuilder.scala	Overrides pruneColumns to capture pruned metadata schema from Spark's optimizer
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageScan.scala	Accepts metadataSchema parameter and includes it in readSchema output
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackagePartitionReaderFactory.scala	Constructs metadata values from PartitionedFile and wraps base reader with PartitionReaderWithMetadata
spark/spark-4.1/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala	Adds 8 test cases covering schema validation, hidden column semantics, value correctness, filtering, and projection
spark/spark-4.0/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala	Identical implementation for Spark 4.0
spark/spark-4.0/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala	Identical test coverage for Spark 4.0
spark/spark-3.5/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala	Identical implementation for Spark 3.5
spark/spark-3.5/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala	Identical test coverage for Spark 3.5
spark/spark-3.4/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala	Identical implementation for Spark 3.4
spark/spark-3.4/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala	Identical test coverage for Spark 3.4

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-15T07:29:13Z

spark/spark-3.5/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala


 import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs}
-import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}


The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change

import org.apache.spark.sql.{DataFrame, Row, SparkSession}

import org.apache.spark.sql.{DataFrame, SparkSession}

Copilot · 2026-02-15T07:29:14Z

spark/spark-3.4/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala


 import io.minio.{MakeBucketArgs, MinioClient}
-import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.{DataFrame, Row}


The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change

import org.apache.spark.sql.{DataFrame, Row}

import org.apache.spark.sql.DataFrame

Copilot · 2026-02-15T07:29:14Z

spark/spark-4.1/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala


 import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs}
-import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}


The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change

import org.apache.spark.sql.{DataFrame, Row, SparkSession}

import org.apache.spark.sql.{DataFrame, SparkSession}

Copilot · 2026-02-15T07:29:14Z

spark/spark-4.0/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala


 import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs}
-import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}


The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change

import org.apache.spark.sql.{DataFrame, Row, SparkSession}

import org.apache.spark.sql.{DataFrame, SparkSession}

github-actions bot added the sedona-spark label Feb 15, 2026

jiayuasu requested a review from Copilot February 15, 2026 07:25

Copilot started reviewing on behalf of jiayuasu February 15, 2026 07:26 View session

Copilot AI reviewed Feb 15, 2026

View reviewed changes

jiayuasu merged commit e6e99f0 into master Feb 15, 2026
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2651] Add _metadata hidden column support for GeoPackage DataSource V2 reader#2654

[GH-2651] Add _metadata hidden column support for GeoPackage DataSource V2 reader#2654
jiayuasu merged 1 commit intomasterfrom
fix/GH-2651-geopackage-metadata-columns

jiayuasu commented Feb 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 15, 2026

Uh oh!

Copilot AI Feb 15, 2026

Uh oh!

Copilot AI Feb 15, 2026

Uh oh!

Copilot AI Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	import org.apache.spark.sql.{DataFrame, Row, SparkSession}
	import org.apache.spark.sql.{DataFrame, SparkSession}

	import org.apache.spark.sql.{DataFrame, Row}
	import org.apache.spark.sql.DataFrame

Conversation

jiayuasu commented Feb 15, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant