[GH-2360] Support fetching libpostal model data from HDFS/object store#2637
[GH-2360] Support fetching libpostal model data from HDFS/object store#2637
Conversation
7d39f23 to
18190fd
Compare
There was a problem hiding this comment.
Pull request overview
Adds support for resolving libpostal model data directories from remote Hadoop-compatible filesystems by copying them into a per-node local cache before initializing jpostal.
Changes:
- Introduces
HadoopFileSystemUtilsto copy files/directories from remote FS to local, and reuses it from GeoPackage utilities. - Adds
LibPostalDataLoaderto detect remote URIs, download to a hashed local cache directory, and guard concurrent downloads with per-key locks. - Updates libpostal initialization and documentation; adds unit tests using
MiniDFSCluster.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/sedona/sql/utils/HadoopFileSystemUtils.scala | New shared Hadoop FS → local copy helpers for files and directory trees. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala | New remote libpostal dataDir resolver with local caching and concurrency control. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalUtils.scala | Resolves dataDir via loader and disables jpostal download for remote URIs. |
| spark/common/src/main/scala/org/apache/sedona/sql/datasources/geopackage/connection/FileSystemUtils.scala | Refactors GeoPackage copy utility to delegate to shared Hadoop FS helper. |
| spark/common/src/test/scala/org/apache/sedona/sql/HadoopFileSystemUtilsTest.scala | New tests for local/remote detection and HDFS copy behaviors. |
| spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala | New tests for remote-path detection, caching behavior, and concurrent access. |
| docs/api/sql/Function.md | Documents remote URI support for libpostal dataDir configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
18190fd to
8464efc
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
spark/common/src/main/scala/org/apache/sedona/sql/utils/HadoopFileSystemUtils.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/HadoopFileSystemUtilsTest.scala
Outdated
Show resolved
Hide resolved
bea9f8c to
6f7d89d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
e2059a2 to
c7b509a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
c7b509a to
fa29ea9
Compare
fa29ea9 to
74fe0f0
Compare
74fe0f0 to
41c1ac9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
41c1ac9 to
16272c2
Compare
16272c2 to
b9a8d2c
Compare
…store - Add shared HadoopFileSystemUtils with copyFileToLocal and copyDirectoryToLocal - Refactor GeoPackage FileSystemUtils to delegate to shared utility - Add LibPostalDataLoader to resolve remote dataDir (HDFS, S3, GCS, ABFS) to local cache - Update LibPostalUtils to use LibPostalDataLoader for remote path resolution - Disable jpostal auto-download when data is fetched from remote store - Add tests for HadoopFileSystemUtils and LibPostalDataLoader using MiniDFSCluster - Update docs for ExpandAddress and ParseAddress with remote dataDir instructions
b9a8d2c to
2a55427
Compare
docs/api/sql/Function.md
Outdated
There was a problem hiding this comment.
This should still be in a note, not the intro
There was a problem hiding this comment.
i intendtionally removed it since there are too many notes
There was a problem hiding this comment.
but now I bring it back
There was a problem hiding this comment.
maybe they dont need to be notes but they shouldnt be part of the intro
docs/api/sql/Function.md
Outdated
There was a problem hiding this comment.
why wouldnt this be automatically distirbuted as part of the implementation here?
There was a problem hiding this comment.
sc.addFile needs to be called from the driver before tasks run on executors
It requires access to SparkContext, which isn't readily available from within a UDF/expression
There was a problem hiding this comment.
Can we handle this when we initialize the spark session with sedona?
There was a problem hiding this comment.
I don't want to handle that because no everybody needs libpostal so it has to be optional.
But if it requires user to opt in, then it is no different than a sc.addFile call?
There was a problem hiding this comment.
good point. let me think about this
There was a problem hiding this comment.
Maybe a two pronged approach would work:
- add an optmizer rule that performs the sc.addFile call when it sees a libpostal method in the plan
- in the execution itself we also try to perform a sc.addFile call to account for the cosntantFolding case.
Not totally sure if ive missed something
There was a problem hiding this comment.
Automated sc.addFile per query is dangerous and we might end up handling race condition. I think this over complicates the problem. I would rather not have this feature at all.
docs/api/sql/Function.md
Outdated
There was a problem hiding this comment.
im not sure if "how to upload data to s3" is a neccessary part of our docs.
| val localPath = SparkFiles.get(basename) | ||
| val localFile = new File(localPath) |
There was a problem hiding this comment.
in contrast with the docs it seems that this code downloads the data from the hdfs itself.
There was a problem hiding this comment.
no it does not. It checks if Spark actually downloads it. See: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addFile.html#pyspark.SparkContext.addFile
|
Looks good other than the docs. |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[SEDONA-XXX] my subject. Closes Enhancements to libpostal integration: Fetch model from HDFS/object store #2360What changes were proposed in this PR?
This PR enables libpostal model data (
spark.sedona.libpostal.dataDir) to be loaded from remote filesystems such as HDFS, S3 (s3a://), GCS (gs://), and ABFS (abfs://), in addition to local paths.Problem
jpostal requires the ~2 GB libpostal model data to reside on the local filesystem. In cloud deployments, users must manually pre-install the data on every executor node, which is operationally cumbersome.
Solution
Leverages Spark's native file distribution mechanism (
SparkContext.addFile/SparkFiles.get) to distribute remote data to executors. Users callsc.addFile(remotePath, recursive = true)once before running queries, and Sedona resolves the data locally viaSparkFiles.get(basename).LibPostalDataLoader: New object that resolves remote paths viaSparkFiles.get(). For local paths andfile://URIs, values are passed through unchanged (withfile:URIs normalized to plain paths). For remote URIs, the basename is extracted and looked up through SparkFiles. Uses case-insensitive scheme comparison forfile:URIs.LibPostalUtils: Modified to callLibPostalDataLoader.resolveDataDir()before jpostal init, and to disable jpostal's built-in download when using a remote path.sc.addFile, anIllegalStateExceptionis thrown with instructions to callsc.addFile(remotePath, recursive = true).Files changed
spark/common/.../expressions/LibPostalDataLoader.scalaspark/common/.../expressions/LibPostalUtils.scalaLibPostalDataLoader.resolveDataDir()before jpostal initdocs/api/sql/Function.mdsc.addFileusage forExpandAddressandParseAddressspark/common/.../sql/LibPostalDataLoaderTest.scalaHow was this patch tested?
isRemotePath— 11 URI schemes (local, relative,file://,hdfs://with/without host,s3a://,gs://,abfs://,wasb://, empty, Windows)resolveDataDir— local path unchanged,file://URI normalizationIllegalStateExceptionwith clearsc.addFile(..., recursive = true)instructions when data not foundresolveDataDirresolves a remote URI to the correct local path with all subdirectories intactmvn test -pl spark/common -Dspark=3.5 -Dscala=2.12 -Dtest=none -DfailIfNoTests=false -DwildcardSuites="org.apache.sedona.sql.LibPostalDataLoaderTest"Did this PR include necessary documentation updates?
docs/api/sql/Function.mdfor bothExpandAddressandParseAddressto document the default data directory (<java.io.tmpdir>/libpostal/),spark.sedona.libpostal.dataDirconfiguration, data preparation instructions (libpostal install + upload to HDFS/S3), and instructions for usingsc.addFilewith remote paths.