-
-
Notifications
You must be signed in to change notification settings - Fork 17
Description
@cloudnativegeo/cng-editorial-board will review this submission.
Blog Post Title
Introducing geoparquet-io
Author(s)
Summary
We're announcing geoparquet-io (gpio), an opinionated Python CLI tool for converting, validating, and optimizing GeoParquet files. It uses DuckDB, PyArrow, and obstore to enforce cloud-native best practices by default—bbox columns, Hilbert ordering, ZSTD compression—and supports composable pipelines via Unix pipes and a fluent Python API.
Why this post is relevant to Cloud Native Geo
gpio is built specifically to make cloud-native GeoParquet easier to produce and consume. It automates best practices from the GeoParquet spec, simplifies cloud storage workflows across S3/GCS/Azure, and supports spatial partitioning strategies (H3, S2, admin boundaries, etc.) that are central to cloud-native geospatial data distribution.
Timeline
- Draft submission date: [TBD]
- Final publication date: [TBD]
Anything else to share?
Full draft below.
Introducing geoparquet-io
A Python CLI tool for optimizing GeoParquet data
By Chris Holmes and Nissim Lebovits
We're releasing geoparquet-io (or, gpio), an opinionated command-line tool for converting, validating, and optimizing GeoParquet files.
gpio is written in Python and uses DuckDB (with GDAL embedded for legacy format support), PyArrow, and obstore for fast operations on larger-than-memory datasets. By default, gpio enforces best practices: bbox columns, Hilbert ordering, ZSTD compression, and smart row group sizes.
What does it do?
gpio offers a CLI and a fluent Python API to help you create, validate, and optimize GeoParquet files. The CLI is designed for composability; commands chain together with Unix pipes, produce structured output with --json flags, and are predictable enough for use with AI coding assistants. The Python API keeps data in memory as Arrow tables, avoiding file I/O entirely and integrating directly into existing workflows.
Convert with optimized defaults
With gpio convert, you can seamlessly convert from (and to) legacy formats like Shapefiles, GeoJSON, and GeoPackages:
# One command: converts, adds bbox, Hilbert-sorts, compresses
gpio convert buildings.shp buildings.parquetBy default, the resulting GeoParquet files are optimized for best practices, including:
bboxcolumn with covering metadata- Hilbert curve spatial ordering
- ZSTD compression
- Appropriate row group sizes
- Automatic partitioning (when appropriate)
These optimizations improve compression, I/O, and spatial query performance by 10–100x. Existing GeoParquet files can also be optimized in-place with gpio check all --fix.
Pipes and chains
One of gpio's strengths is composability. On the CLI, commands chain together with Unix pipes using Arrow IPC streaming—no intermediate files:
# Extract Senegal from global admin boundaries, Hilbert-sort
gpio extract --bbox "-18,14,-11,18" \
https://data.fieldmaps.io/edge-matched/humanitarian/intl/adm2_polygons.parquet | \
gpio sort hilbert - senegal_adm2.parquet# Chain enrichment steps together
gpio add bbox input.parquet | \
gpio add h3 --resolution 9 - | \
gpio sort hilbert - enriched.parquetThe Python API mirrors this with a fluent interface:
import geoparquet_io as gpio
gpio.read('buildings.parquet') \
.add_bbox() \
.add_h3(resolution=9) \
.sort_hilbert() \
.write('s3://bucket/optimized.parquet')Large files and cloud workflows
Large files can be automatically partitioned based on a target row count. Partitioning strategies include H3, KD-tree, quadkey, admin boundaries (via Overture and GAUL), A5, and S2, as well as partitioning by an arbitrary existing column.
DuckDB handles all transformations with streaming SQL execution—memory stays constant regardless of file size. Its spatial extension reads legacy formats via GDAL's ST_Read, and its httpfs extension handles remote file reads from S3, GCS, and Azure. PyArrow handles Parquet I/O and returns Arrow tables for seamless integration with pandas and Polars. For cloud writes, obstore enables streaming output to S3, GCS, and Azure.
# Convert shapefile → auto-partition by H3 → write directly to S3
gpio convert large_roads.shp | \
gpio partition h3 - s3://bucket/roads/ --auto --hive --profile prodWhy not GDAL?
gpio uses GDAL under the hood—it's what makes all the format conversions work. The difference is focus: GDAL is a general-purpose toolkit, while gpio is opinionated about cloud-native GeoParquet with sensible defaults.
One example is cloud storage. Remote reads and writes in gpio just work—pass a URL and go:
# Just works - no /vsicurl prefix needed
gpio inspect https://data.fieldmaps.io/edge-matched/humanitarian/intl/adm2_polygons.parquet| Feature | GDAL 3.9+ | gpio |
|---|---|---|
bbox column |
Yes (default) | Yes |
| Spatial sorting | Optional, bbox-based |
Hilbert curve (better clustering) |
| Sorting default | OFF | ON |
| Sorting overhead | Temp GeoPackage file | In-memory or streaming |
| Partitioning | No | H3, S2, A5, quadkey, KD-tree, admin |
| Validation | No | Spec compliance checking |
| Fix issues in-place | No | --fix flags |
| Read from S3/HTTP | /vsis3/ or /vsicurl/ prefix |
Just use the URL |
| Write to S3 | Manual /vsis3/ + env vars |
Direct path via obstore |
| Credential handling | Manual configuration | Automatic (AWS, GCP, Azure) |
Additional features
bbox-based subsetting of datasets for spatial filtering and extraction- Service extraction from ArcGIS Feature Services and BigQuery tables → GeoParquet
- Easy inspection of metadata, row previews, and statistics
- PMTiles generation via the
gpio-pmtilesplugin - A Claude Code skill for AI-assisted spatial data workflows
gpio supports GeoParquet 1.1, 2.0, and native Parquet geometry/geography types.
How can I help?
gpio is currently released in v1.0-beta. At this stage, we're looking for early users to help with stress-testing, bug reports, feature requests, and—of course—PRs. Check out the open GitHub issues to see what's currently planned.