Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] ArrowKeyError: Attempted to register factory for scheme 'file' when using pip-installed GDAL #44696

Open
gbelouze opened this issue Nov 11, 2024 · 6 comments

Comments

@gbelouze
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

As always, let me preface this with a thanks for this project and the help the arrow developers are giving to the community (which I have benefited from before).

Description

I've been tracking a more-than-likely install related bug. Here is a minimal reproducer

# file test.py
import osgeo
from pyarrow import fs
local = fs.LocalFileSystem()

and if I run it

$ python test.py                                                                                                                                                                                                    
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    local = fs.LocalFileSystem()
            ^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 1112, in pyarrow._fs.LocalFileSystem.__init__
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowKeyError: Attempted to register factory for scheme 'file' but that scheme is already registered.

Note that if I do not import osgeo there is no error.

Reproduce

The steps from scratch are the following

conda create -n test-arrow python=3.12
conda activate test-arrow
pip install pyarrow 'gdal==3.9.2'

Probably relevant is the fact that I have gdal libraries installed with homebrew (on MacOS)

$ brew info gdal
==> gdal: stable 3.9.2 (bottled), HEAD
Geospatial Data Abstraction Library
https://www.gdal.org/
Conflicts with:
  avce00 (because both install a cpl_conv.h header)
  cpl (because both install cpl_error.h)
Installed
/opt/homebrew/Cellar/gdal/3.9.2_1 (497 files, 35.9MB) *
  Poured from bottle using the formulae.brew.sh API on 2024-09-27 at 14:14:44
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/g/gdal.rb

Workaround

Everything works if I conda install gdal instead of using pip (presumably, this also install the libraries and does not use the homebrew-installed gdal).

Component(s)

Python

@raulcd raulcd changed the title ArrowKeyError: Attempted to register factory for scheme 'file' when using pip-installed GDAL [Python][C++] ArrowKeyError: Attempted to register factory for scheme 'file' when using pip-installed GDAL Nov 12, 2024
@raulcd
Copy link
Member

raulcd commented Nov 12, 2024

Thanks for raising this. It seems like we are registering on the FileSystemFactoryRegistry the filesystem twice on this specific case. @bkietz I might try to investigate further but any clue on what might be happening here?

@bkietz
Copy link
Member

bkietz commented Nov 12, 2024

This is probably due to conflicting versions of libarrow in the same process (one from brew's gdal and one from conda's pyarrow). Each libarrow is registering different factories for the file:// scheme, which is raised as an error. Installing with just conda ensures that only a single version of libarrow is visible which ensures that only one set of factories is registered. @gbelouze Could you check if gdal and conda indeed have multiple libarrows?

Having multiple versions of libarrow in play seems like something we'd want to avoid in any case; more subtle errors could arise than this KeyError with filesystem registration. Maybe we should try to assert this at runtime?

@gbelouze
Copy link
Author

Thanks for your response. I am a bit puzzled because in the steps to produce the errors I am not installing anything with conda, just using it as a python environment manager. Could you maybe point to how I should check for libarrows ?

I have

$ conda list                                                                                                                         ─╯
# packages in environment at /opt/homebrew/Caskroom/miniconda/base/envs/test-arrow:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                h99b78c6_7    conda-forge
ca-certificates           2024.8.30            hf0a4a13_0    conda-forge
gdal                      3.9.2                    pypi_0    pypi
libexpat                  2.6.4                h286801f_0    conda-forge
libffi                    3.4.2                h3422bc3_5    conda-forge
libsqlite                 3.47.0               hbaaea75_1    conda-forge
libzlib                   1.3.1                h8359307_2    conda-forge
ncurses                   6.5                  h7bae524_1    conda-forge
openssl                   3.4.0                h39f12f2_0    conda-forge
pip                       24.3.1             pyh8b19718_0    conda-forge
pyarrow                   18.0.0                   pypi_0    pypi
python                    3.12.7          h739c21a_0_cpython    conda-forge
readline                  8.2                  h92ec313_1    conda-forge
setuptools                75.3.0             pyhd8ed1ab_0    conda-forge
tk                        8.6.13               h5083fa2_1    conda-forge
tzdata                    2024b                hc8b5060_0    conda-forge
wheel                     0.45.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h57fd34a_0    conda-forge

@bkietz
Copy link
Member

bkietz commented Nov 12, 2024

If you're using conda, that will put libarrow.so in $CONDA_PREFIX/lib/

Brew is configurable. IIUC the default is /usr/opt/lib but you can get the exact location for a package with brew --prefix gdal

I expect you will see libarrow in both of those places in your error case

On Linux you can see which dynamic libraries are loaded at runtime with

>>> print(pathlib.Path("/proc/self/maps").read_text())

... but I don't know what the equivalent for macOS would be

@paleolimbot
Copy link
Member

I think in your case you're getting a statically-linked Arrow (via pyarrow, since it looks like it was installed from pypi) and a dynamically-linked Arrow (via pip install gdal, which I believe builds against your homebrew Arrow). I am wondering if that error would also occur if the versions were identical because of the static/dynamic mismatch.

Also possibly related: OSGeo/gdal#10539

@ludwick
Copy link

ludwick commented Nov 14, 2024

The basic question: Is it possible to install pyarrow in a way that doesn't include libarrow as a static bundle or otherwise tell it to use a system installed one?

Background: I am on a mac and using R & RStudio. Our R code uses many packages that use gdal and on Mac the natural way to get that is to use homebrew install which also installs the homebrew package apache-arrow which includes libarrow. But I'm also working with python code that calls out to R packages (using rpy2) which naturally loads the homebrew version of libarrow indirectly. I'm also using geopandas which in order to write out GeoDataFrame objects into parquet files requires pyarrow installed. And thus I hit this issue.

I can workaround it in a number of ways:

  • round trip geopandas dataframes into R, then back to python pandas data frames and use pandas.DataFram.to_parquet with engine="fastparquet" (this maps the geometry column into WKT)
  • within python convert geopandas dataframes into pandas versions, manually converting the geometry column and then use pandas to_parquet (as above).
  • rewrite any code that needs to write geopandas dataframes to parquet to avoid import of rpy2 (thus avoiding the R environment being loaded and thus loading system / homebrew libarrow).

But given that libarrow installed as part of pyarrow on disk is already 50MB and I have another one installed in the homebrew setup (same version even!) it would be nice to just have one installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants