Skip to content

[GH-2651] Add _metadata hidden column support for GeoPackage DataSource V2 reader#2654

Merged
jiayuasu merged 1 commit intomasterfrom
fix/GH-2651-geopackage-metadata-columns
Feb 15, 2026
Merged

[GH-2651] Add _metadata hidden column support for GeoPackage DataSource V2 reader#2654
jiayuasu merged 1 commit intomasterfrom
fix/GH-2651-geopackage-metadata-columns

Conversation

@jiayuasu
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

When reading GeoPackage files via the DataSource V2 API, the standard _metadata hidden column (containing file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time) was missing from the DataFrame. This is because GeoPackageTable did not implement Spark's SupportsMetadataColumns interface.

This PR implements _metadata support across all four Spark version modules (3.4, 3.5, 4.0, 4.1) by modifying four source files per module:

  1. GeoPackageTable — Mixes in SupportsMetadataColumns and defines the _metadata MetadataColumn with the standard six-field struct type.
  2. GeoPackageScanBuilder — Overrides pruneColumns() to capture the pruned metadata schema requested by Spark's column pruning optimizer.
  3. GeoPackageScan — Accepts the metadataSchema parameter, overrides readSchema() to append metadata fields to the output schema, and passes the schema to the partition reader factory.
  4. GeoPackagePartitionReaderFactory — Constructs metadata values (path, name, size, block offset/length, modification time) from the PartitionedFile, and wraps the base reader in a PartitionReaderWithMetadata that joins data rows with metadata using JoinedRow + GenerateUnsafeProjection. Correctly handles Spark's struct pruning by building only the requested sub-fields.

After this change, users can query _metadata on GeoPackage DataFrames just like Parquet/ORC/CSV:

val df = spark.read.format("geopackage").option("tableName", "my_table").load("/path/to/data.gpkg")
df.select("geometry", "_metadata.file_name", "_metadata.file_size").show()
df.filter($"_metadata.file_name" === "specific.gpkg").show()

How was this patch tested?

8 new test cases added to GeoPackageReaderTest (per Spark version module) covering:

  • Schema validation: _metadata struct contains all 6 expected fields with correct types
  • Hidden column semantics: _metadata does not appear in select(*) but can be explicitly selected
  • Value correctness: file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time are verified against actual filesystem values using java.io.File APIs
  • Filtering: _metadata fields can be used in WHERE clauses
  • Projection: _metadata fields can be selected alongside data columns

All tests pass on all four Spark versions:

  • spark-3.4 (Scala 2.12): 18 tests passed
  • spark-3.5 (Scala 2.12): 18 tests passed
  • spark-4.0 (Scala 2.13): 18 tests passed
  • spark-4.1 (Scala 2.13): 18 tests passed

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation. The _metadata column is a standard Spark hidden column that is automatically available to users — no Sedona-specific API changes are introduced.

…ce V2 reader

Implement SupportsMetadataColumns on GeoPackageTable so that reading
GeoPackage files into a DataFrame exposes the standard _metadata hidden
struct containing file_path, file_name, file_size, file_block_start,
file_block_length, and file_modification_time.

Changes across all four Spark version modules (3.4, 3.5, 4.0, 4.1):

- GeoPackageTable: mix in SupportsMetadataColumns, define the _metadata
  MetadataColumn with the standard six-field struct type
- GeoPackageScanBuilder: override pruneColumns() to capture the pruned
  metadata schema requested by Spark column pruning optimizer
- GeoPackageScan: accept metadataSchema, override readSchema() to append
  metadata fields, pass schema to partition reader factory
- GeoPackagePartitionReaderFactory: construct metadata values from the
  PartitionedFile, wrap the base reader in PartitionReaderWithMetadata
  that joins d  that joins d  that joins d  that joins d  that joins d  that joins d  that joins d  that joins d  that joins d  that joins d  that joins d  thalues against the filesystem,
filtering, and projection.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the _metadata hidden column to the GeoPackage DataSource V2 reader, enabling users to query file-level metadata (file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time) on GeoPackage DataFrames. The implementation follows Spark's standard pattern for metadata columns and is consistently applied across all four Spark version modules (3.4, 3.5, 4.0, 4.1).

Changes:

  • Implemented SupportsMetadataColumns interface in GeoPackageTable to expose the _metadata hidden column
  • Added metadata schema pruning logic in GeoPackageScanBuilder to capture requested metadata fields
  • Modified GeoPackageScan and GeoPackagePartitionReaderFactory to propagate and populate metadata values
  • Added comprehensive test coverage to verify schema, hidden column semantics, value correctness, filtering, and projection

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala Implements SupportsMetadataColumns and defines _metadata column structure with 6 standard fields
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageScanBuilder.scala Overrides pruneColumns to capture pruned metadata schema from Spark's optimizer
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageScan.scala Accepts metadataSchema parameter and includes it in readSchema output
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackagePartitionReaderFactory.scala Constructs metadata values from PartitionedFile and wraps base reader with PartitionReaderWithMetadata
spark/spark-4.1/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala Adds 8 test cases covering schema validation, hidden column semantics, value correctness, filtering, and projection
spark/spark-4.0/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala Identical implementation for Spark 4.0
spark/spark-4.0/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala Identical test coverage for Spark 4.0
spark/spark-3.5/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala Identical implementation for Spark 3.5
spark/spark-3.5/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala Identical test coverage for Spark 3.5
spark/spark-3.4/src/main/scala/org/apache/sedona/sql/datasources/geopackage/*.scala Identical implementation for Spark 3.4
spark/spark-3.4/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala Identical test coverage for Spark 3.4

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.{DataFrame, SparkSession}

Copilot uses AI. Check for mistakes.

import io.minio.{MakeBucketArgs, MinioClient}
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.{DataFrame, Row}
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.DataFrame

Copilot uses AI. Check for mistakes.

import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.{DataFrame, SparkSession}

Copilot uses AI. Check for mistakes.

import io.minio.{MakeBucketArgs, MinioClient, PutObjectArgs}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Row import is unused and should be removed. The tests do not create or use Row objects directly.

Suggested change
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.{DataFrame, SparkSession}

Copilot uses AI. Check for mistakes.
@jiayuasu jiayuasu merged commit e6e99f0 into master Feb 15, 2026
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeoPackage reader does not support _metadata hidden column

1 participant