Enhancements to libpostal integration: Fetch model from HDFS/object store

Thank you for the great integration of libpostal described in https://github.com/apache/sedona/issues/2074

I have the following enhancement proposal to make it more usable in an enterprise context. The main issue is that in an enterprise context there is usually no Internet connectivity available from a Spark cluster and also no direct access to the nodes. Thus, it is difficult to use the libpostal integration as it needs to download the model from the internet.

Based on the libpostal integration pull request https://github.com/apache/sedona/pull/2077, I can see that a config "spark.sedona.libpostal.dataDir" is accepted. It defaults into a local tmp-dir, because libpostal can only load from a local filesystem.

I propose the following addition:
Accept a folder on HDFS, object stores (e.g. S3 etc.). If you have a larger job with a lot of nodes then it is much more efficient to load from HDFS/object stores than the Internet (and Internet may not be available, server down etc.). 

Since libpostal expects a local directory, I propose that if someone puts spark.sedona.libpostal.dataDir to, for example, "s3a://blabla/libpostal" that it uses the Hadoop dependency of Spark to list the content of the dataDir (https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html) , e.g. via
```
FileSystem.get(sparkContext.hadoopConfiguration).listFiles()
...
```
copy all the files to a local tmp directory using the Filesystem class (if not done already) and point libpostal to it.

Additionally, I propose that the documentation in Apache Sedona contains a small shell script how to fetch the data via Internet so that a user can upload it to HDFS/object store (e.g. S3). Maybe sth. similar to https://github.com/openvenues/libpostal/blob/master/src/libpostal_data.in


@james-willis 






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements to libpostal integration: Fetch model from HDFS/object store #2360

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancements to libpostal integration: Fetch model from HDFS/object store #2360

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions