-
Notifications
You must be signed in to change notification settings - Fork 749
Description
Thank you for the great integration of libpostal described in #2074
I have the following enhancement proposal to make it more usable in an enterprise context. The main issue is that in an enterprise context there is usually no Internet connectivity available from a Spark cluster and also no direct access to the nodes. Thus, it is difficult to use the libpostal integration as it needs to download the model from the internet.
Based on the libpostal integration pull request #2077, I can see that a config "spark.sedona.libpostal.dataDir" is accepted. It defaults into a local tmp-dir, because libpostal can only load from a local filesystem.
I propose the following addition:
Accept a folder on HDFS, object stores (e.g. S3 etc.). If you have a larger job with a lot of nodes then it is much more efficient to load from HDFS/object stores than the Internet (and Internet may not be available, server down etc.).
Since libpostal expects a local directory, I propose that if someone puts spark.sedona.libpostal.dataDir to, for example, "s3a://blabla/libpostal" that it uses the Hadoop dependency of Spark to list the content of the dataDir (https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html) , e.g. via
FileSystem.get(sparkContext.hadoopConfiguration).listFiles()
...
copy all the files to a local tmp directory using the Filesystem class (if not done already) and point libpostal to it.
Additionally, I propose that the documentation in Apache Sedona contains a small shell script how to fetch the data via Internet so that a user can upload it to HDFS/object store (e.g. S3). Maybe sth. similar to https://github.com/openvenues/libpostal/blob/master/src/libpostal_data.in